Open ianfore opened 4 years ago
ISB-CGC bioinformaticians will examine SRA data in BigQuery. The goal is to compare how data is indexed in SRA to how NCI's ISB-CGC Cloud Resource is presenting similar information to its end-users in BigQuery tables at ISB-CGC. Insights into how researchers could combine information from both systems will be provided.
8/10/20 Kurt outlined the approach in progress for making SRA data available via DRS. There are three components RAS Clearing House - for auth and authz DRS service ID Exchange -
The ID Exchange would be passed an accession e.g. SRR and return a DRS id. As there are multiple files for each accession the plan is to return a DRS id to what would be a bundle. For example, currently the accession SRR1999478 has four files which would have to be bundled. The following json is not a proposed format but serves to illustrate what would need to be bundled and where it exists.
{
"accession": "SRR1999478",
"files": [
{
"name": "14_DN.unmap.bam", "type": "bam",
"locality": [
{ "service": "gs", "region": "us", "rehydrationRequired": true },
{ "service": "s3", "region": "us-east-1", "rehydrationRequired": true }
]
},
{
"name": "14_DN.BWA.MARK.bam", "type": "bam",
"locality": [
{ "service": "gs", "region": "us", "rehydrationRequired": true },
{ "service": "s3", "region": "us-east-1", "rehydrationRequired": true }
]
},
{
"name": "SRR1999478.pileup","type": "sra",
"locality": [
{ "service": "sra-ncbi", "region": "dbgap" }
]
},
{
"name": "SRR1999478", "type": "sra",
"locality": [
{ "service": "sra-ncbi", "region": "dbgap" },
{ "service": "gs", "region": "us" },
{ "service": "s3", "region": "us-east-1"}
]
}
]
}
There is no existing externally usable id/accession for the individual files for this SRR. DRS ids would also be generated for each of the files which could then be used to retrieve those which the user requires.
The challenge for code that reads the bundle is to work out what the types are of the individual files are and to work out which to use for the purpose at hand. This amounts to understanding the semantics of the bundle. In the example above there are two files of type bam. There is no way of understanding the significance (semantics) of what those two files are. A human with the right knowledge might infer some meaning from the file names, but there is no consistency in file naming, and even if there were containing structured meaning in a filename is not a sound approach. For the two files listed with a type of sra one might infer that the file named pileup is a different type of file.
A second example of SRA content that would need to be represented.
{
"accession": "SRR7274638",
"files": [
{
"name": "95436.recal.cram", "type": "cram",
"locality": [
{ "service": "s3", "region": "us-east-1" },
{ "service": "gs", "region": "us" }
]
},
{
"name": "95436.recal.cram.crai", "type": "crai",
"locality": [
{ "service": "s3", "region": "us-east-1" },
{ "service": "gs", "region": "us" }
]
}
]
}
In this case the type attribute is more informative. The filename also conveys the type through the conventional use of the file extension but the convenience of the distinct type attribute is useful.
A key question is whether the semantics of multiple objects should be represented within the bundle in a machine actionable way. Or whether those multiple objects should be referenced in an external queryable schema which is used to obtain the precise ids needed for a particular purpose.
The SRA public DRS server is now available. This gives the opportunity to work through two approaches to the problem with real examples.
Approach 1 - unpacking a bundle The following DRS call uses a drs_id which corresponds to the SRA run accession no (SRR1599287). https://locate.ncbi.nlm.nih.gov/ga4gh/drs/v1/objects/99b71bee00f3dbc6d583887b91ea9a2f For convenience, see attached response SRR1599287_drs.txt
The response is a bundle describing three files and the individual drs_ids for each. In order to determine which file is relevant for a given purpose it is necessary to parse the filename. There is no convention for file naming in DRS. It is not suggested here that there should be.
Approach 2 - identify the specific file through Search
The Discovery Search reference implementation contains a table onek_genomes.sra_drs_files which may be queried as follows to obtain the drs_id for a specific file of interest.
SELECT drs_id, filename
FROM thousand_genomes.onek_genomes.sra_drs_files
where filetype = 'bam' and mapped = 'mapped'
More realistically, a search for files will be based on broader criteria including sample and subject attributes See FASPScript14.py for a fully worked example.
Under this approach the specific file of interest can be identified through a mechanism (GA4GH Search) which provides a machine readable schema.
The practice of making available tables to query for specific files is widespread among GA4GH Driver Projects. See FASPScript2 which illustrates this for both the Cancer Research Data Commons and BioDataCatalyst.
An additional approach worth exploring is how the Research Objects initiative has handled the issue of describing the contents of an object like a bundle.
Jim Vlasblom provided the following information
We have some data that might help with the item "Resolving accessions - SRA use case".
We've ingested public SRA metadata into bigquery (directly from the NCBI) and created DRS records that we've linked to this metadata. There are three tables in the striking-effort-817:ncbi_sra dataset, to which I've given you access:
'drs' - contains copies of the SRA subset of DRS records served by our drs server. The DRS Server actually uses a different database with both SRA and other DRS records, so changes here will not affect the server.
'meta' - Metadata scraped from the SRA. At this time we're missing a few key columns (e.g. run identifiers) but we'll be adding them soon.
'drsmeta' - the two tables above joined together. Prejoining improves performance in Presto, which would otherwise read in both tables and try to join it through our single presto node
You can use these to look up a DRS record by SRA metadata -- either by directly querying for the full DRS record (since we happen to mirror them in bigquery), or by doing a more "typical" workflow of looking up a DRS identifier and then querying the DRS server.
To lookup a DRS record by id, you can use something like:
https://drs-server.staging.dnastack.com/ga4gh/drs/v1/objects/0bf7f02b-f334-4060-a402-40281cd8e2be
Where the 0bf702b... is the DRS id. In our tables, we mostly just record this part of the DRS identifier right now. The full proper DRS identifier would be drs://drs-server.staging.dnastack.com/0bf7f02b... and is reported in the DRS record's "self URI".
Note: the DRS server requires some basic auth credentials. I will send these to you shortly.
It would make sense resources/approaches alongside the SRA DRS Service and ID Exchange. Note that data available on the DNAStack Public Search instance through the dbgap_demo.scr_gecco_susceptibility tables such as sb_drs_index are also a route to sequence data represented in SRA - albeit with the data in separate cloud storage. See this code example.
Relevant dialog about the DNAStack tables containing SRA data
On Thu, Jan 7, 2021 at 12:30 PM Fore, Ian wrote:
When you refer to the public SRA metadata did you use their own BigQuery tables? Or did you import into striking-effort-817:ncbi_sra from somewhere else?
https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery-examples/
On Thu, Jan 7, 2021 at 12:45 PM Ayman Al Baz wrote:
While our metadata does share many similar fields as the bigquery table provided by NCBI, the metadata we have wasn't collected from the link you provided. We collected the metadata directly from NCBI using NCBI's Entrez API as the Entrez API is more comprehensive than the linked bigquery table.
On Thu, Jan 7, 2021 at 12:53 PM Jim Vlasblom wrote
Thanks Ayman. I'll also add that this data contains publicly available metadata. Some of the data itself is publicly accessible (if it has a non null access URL), and some of it is not (null/missing access URL).
We've updated our script to grab more metadata, and have created an updated striking-effort-817.ncbi_sra.january2021 table joining metadata to DRS records in the DNAstack DRS server. Some notes on this are here: https://docs.google.com/document/d/17SFjBmr5WyA9WJsGIdubM4FhcGQAY68gbdx2rGD6Xk0/view
In this hackathon exercise SRA would be used as a test case to explore how biological entities (logical level) are handled in relation to the immutable physical objects in DRS.
INSDC ids used by in SRA identify logical level/biological entities such as sequencing runs (SRRnnnnnn). The mapping to immutable digital objects (DRS ids) is not as simple as might be expected for two reasons. a) SRRs map to more that one digital object (e.g. a cram and a crai file) b) the immutable digital object(s) to which they map may change (e.g. after alignment to a different reference sequence) The attached example sdl_example1.txt shows a response for data for SRR7274638 from the SRA Data Locator as an example of the use case.
Schema which define the biological entities would provide the data model defining the relationships between the objects.
The exercise would be to test out schema searchable via the Discovery Search prototype which map logical level ids to immutable DRS objects. In the SRA case this could potentially be as simple as using the NCBI implementation of SRA in BigQuery.