add sequencing_data_pid and index file

schristley commented 3 years ago

closes #435

Also added CURIE as an x-airr format.

bcorrie commented 2 years ago

@schristley I don't know what the index_file is, I am not familiar with this file, can you explain?

schristley commented 2 years ago

It's a third read spot, e.g. in this SRA record. It's called an index read and some tools may use it when processing the raw data. I don't think it's necessary for more recent versions of CellRanger, for example, but if the user uploads into the SRA record, I figured we might want to annotate it.

schristley commented 2 years ago

I changed sequencing_data_pid to sequence_data_ref for consistency with the object name and the semantics of _ref in the germline objects.

Is this consistent with the intended meaning of _pid, _gid, and _ref. We are only using _ref (for external/global persistent identifiers), yes?

I don't know. Do we plan to change the names of study_id, pub_ids, etc to be _ref?

javh commented 2 years ago

As far as I know, we don't plan to change study_id and pub_id. I guess I'm also asking if _pid/_gid should be introduced or if that's already covered by _ref.

schristley commented 2 years ago

As far as I know, we don't plan to change study_id and pub_id. I guess I'm also asking if _pid/_gid should be introduced or if that's already covered by _ref.

I'm not aware of specific decisions on which to use when. Not sure why I used _pid initially... Should it be sequencing_data_ref? Or just sequencing_data_id which matches other fields? Either way, the test data will need to be modified with the right name for the tests to pass.

javh commented 2 years ago

_ref or _id makes sense to me. In the absence of any specific reason not to, I prefer _id. But, either seems fine given the language of AlleleDescription::allele_description_ref and AlleleDescription::allele_description_id in #593.

schristley commented 2 years ago

_ref or _id makes sense to me. In the absence of any specific reason not to, I prefer _id. But, either seems fine given the language of AlleleDescription::allele_description_ref and AlleleDescription:: allele_description_id in #593.

I'm not completely sure I understand _ref, or why there needs to both _ref and _id for AlleleDescription. I too would probably prefer _id, just to be consistent with existing fields in the core schema, until there were specific rules about naming.

javh commented 2 years ago

That's a chorus of 2 for _id. I'll take it. Swapped it to raw_sequence_data_id.

I have question though. Should this object be SequencingData instead of RawSequenceData? It would be more consistent with the SequencingRun object and I'm skeptical of the word "raw".

schristley commented 2 years ago

I have question though. Should this object be SequencingData instead of RawSequenceData? It would be more consistent with the SequencingRun object and I'm skeptical of the word "raw".

That's just a reflection of MiAIRR wording which mentions "raw sequencing data".

javh commented 2 years ago

All right, I removed the "raw" bit. I'm good to merge this if we're all agreed.

bussec commented 2 years ago

@schristley My thoughts on this:

The SRA CURIE prefix resolves to string containing a query component ?run= which has two issues: a. In my interpretation of the W3C CURIE recommendations, the query component is only allowed in the reference, not the prefix b. The query searches a run along with an SRR (=run) identifier. However,SRA also contains other objects than run, so the SRA prefix could be misleading
Please note that the CURIE resolver section in the Schema has now changed due to the merge of the OntoVoc 04/21 PR (#524). This make the spilt between the identifier function of the URI and the data retrieval function is now split.
I agree that we should use different suffixes (_id,_ref, _gid, _pid) consistently and only if the distinction is meaningful. From my perspective an _id is a catch-all term, that does not imply anything in terms of persistence, global uniqueness or resolvability. In contrast, a _gid MUST be globally unique and a _pid MUST be a "real" PID, i.e., persistent, globally unique and resolvable (the various INSDC ID's fulfill these criteria). IIRC we introduced _ref to indicate that fact that these are external identifiers for which we cannot guarantee persistence, i.e., more like with URLs.

schristley commented 2 years ago

So I guess this is the first field with trying to use CURIE for an external identifier instead of for an ontology? Should we punt on the CURIE stuff and just add it to #465 with the other ids?

javh commented 2 years ago

If we go with _id, then I think punting the problem to future-us (v2) should be fine (for this particular object at least).

bcorrie commented 2 years ago

I am not sure we can't get this done in v1.4. I think the main question is:

The SRA CURIE prefix resolves to string containing a query component ?run= which has two issues: a. In my interpretation of the W3C CURIE recommendations, the query component is only allowed in the reference, not the prefix value b. The query searches a run along with an SRR (=run) identifier. However,SRA also contains other objects than run, so the SRA prefix could be misleading

For part b) I think as long as SRA:SRR11610494 resolves using the CURIE this should be OK. The fact that the URL the prefix resolves to can be used with some other ID shouldn't prevent us from using it should it?

For part a) @bussec can you elaborate on your concern - I don't quite follow...

javh commented 2 years ago

Any objections to merging this mostly as is, but with a flexible definition for sequencing_data_id, and then bumping the CURIE topic to V2?

bussec commented 2 years ago

@bcorrie @schristley

For part b) I think as long as SRA:SRR11610494 resolves using the CURIE this should be OK. The fact that the URL the prefix resolves to can be used with some other ID shouldn't prevent us from using it should it?

For a description of SRA's data model see here. I would prefer if the SRA prefix would be able to resolve all SRA objects, not only "runs" (otherwise we might have to introduce e.g. an SRA_SRX prefix at a later point).

What seems to work is using https://www.ncbi.nlm.nih.gov/sra/ as the replacement string, which would also address (a), but it resolves to the main NCBI portal, not the trace archive directly. If this is not an issue for any of your current applications, this would IMO be the best solution.

For part a) @bussec can you elaborate on your concern - I don't quite follow...

Apologies, it should have been: "the query component is only allowed in the reference, not the prefix value". But re-reading the W3C TR again, it is certainly not prohibited to do this, so we might be fine. My main concern is that third-party libraries/APIs might not tolerate the presence of a query component in an identifier string.

bcorrie commented 2 years ago

For part b) I think as long as SRA:SRR11610494 resolves using the CURIE this should be OK. The fact that the URL the prefix resolves to can be used with some other ID shouldn't prevent us from using it should it?

For a description of SRA's data model see here. I would prefer if the SRA prefix would be able to resolve all SRA objects, not only "runs" (otherwise we might have to introduce e.g. an SRA_SRX prefix at a later point).

What seems to work is using https://www.ncbi.nlm.nih.gov/sra/ as the replacement string, which would also address (a), but it resolves to the main NCBI portal, not the trace archive directly. If this is not an issue for any of your current applications, this would IMO be the best solution.

That works for me...

bcorrie commented 2 years ago

For part a) @bussec can you elaborate on your concern - I don't quite follow...

Apologies, it should have been: "the query component is only allowed in the reference, not the prefix value". But re-reading the W3C TR again, it is certainly not prohibited to do this, so we might be fine. My main concern is that third-party libraries/APIs might not tolerate the presence of a query component in an identifier string.

I would have thought that having a prefix involve a query would not be uncommon - we have the same issue with the ENSEMBL CURIE resolution.

    ENSG:
        type: identifier
        default:
            map: ENSG
        map:
            ENSG:
                iri_prefix: "https://www.ensembl.org/Multi/Search/Results?q="

schristley commented 2 years ago

@bcorrie @schristley

For a description of SRA's data model see here. I would prefer if the SRA prefix would be able to resolve all SRA objects, not only "runs" (otherwise we might have to introduce e.g. an SRA_SRX prefix at a later point).

How about we use SRR then? This structure only allows referencing a single SRA run, so it shouldn't be used for an SRA experiment or other SRA objects.

bussec commented 2 years ago

In general, I am open for both potential solutions: Either using SRR as an SRA Read-specific CURIE prefix or using SRA as a Service-specific one. Unfortunately it seems to be difficult to be consistent across services, as ENA does not provide data type specific URLs. So maybe the question is where to we expect these URIs to land?

schristley commented 2 years ago

Not sure if there are data type specific URLs for all SRA objects either. A generic prefix for SRA might be https://www.ncbi.nlm.nih.gov/sra?term=

I do know if doing programmatic access, say to download files, you wouldn't want to parse these HTML pages, instead would use the E-utilities API to get structured data from a search. So I think that implies, the URIs should land at a human-readable page for that identifier. The run-specific URI brings you to SRA trace page for that specific run. With the generic URI, it takes you to the SRA page for the associated experiment. Both seem acceptable to me.

bcorrie commented 2 years ago

If these are truly mirrored at INSDC archives, then we should probably use SRR as the prefix (as that is what we would expect the data object to be) and have two maps like we do for NCBITAXON, no?

    SRR:
        type: curie
        default:
            map: SRA
        map:
            SRA:
                iri_prefix: "https://www.ncbi.nlm.nih.gov/sra/"
            ENA:
                iri_prefix: "https://www.ebi.ac.uk/ena/browser/view/"

So SRR:SRR11610494 resolves to either

Both of which are "correct"???

schristley commented 2 years ago

we should probably use SRR as the prefix (as that is what we would expect the data object to be) and have two maps like we do for NCBITAXON, no?

Using SRR to be specific to the data object sounds good, though I don't like the generic IRI prefix, because it's promiscuous and allows more than SRR identifiers. For example, SRR:PRJNA628125 resolves just fine. That's why I prefer my original IRI prefix (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=) because it can be used for automatic validation. If the id resolves then you know it's a valid SRR identifier, otherwise not, so stuff like SRR:PRJNA628125 can be rejected. It's a small thing but kinda nice.

@bussec Do you know if there is an ENA equivalent for this IRI prefix for SRA run?

Even though SRA and ENA share data and identifiers, I would be tempted for them to have different CURIE names; that is use ERR for ENA, to avoid being SRA centric.

Lastly, if we don't want the SRR specific IRI prefix, but keep it general, I would use SRA/ENA instead because that general IRI prefix can be used for all SRA/ENA identifiers

bussec commented 2 years ago

Had a more extended look at the ENA documentation:

TL;DR: ENA does not use distinct URLs for the different classes in their data model

While the ENA data model does known study, sample, experiment, runs, .. the distinction between them seems to rely exclusively on the accession numbers. This is in line with ENAs presenations as an integrated repo on the [INSDC website].
The URL that should be used to reference data is https://www.ebi.ac.uk/ena/browser/view/<accession_number>. The only alternative endpoint to /ena/browser/view/ that I could find is /ena/browser/api/xml/, which delivers the same information in XML. No, there is not /ena/browser/api/json/ :-(
Further note that ENA has distinct identifiers for Study and Project for compatibility with some legacy data model (see here for detail, second paragraph below the table).

My interpretation of all of this is that there is no such thing as an IRI in INSDC, there are only accession numbers and URLs to request the respective records. Therefore I see two potential ways forward (leaning towards A):

Proposal A

Generic prefix
Service specific InformationProvider URLs
No IRI expansion, respective fields in CURIEMap could also be NULLed.
Tolerates mismatched IDs
Example: SRA_ENA:SRR11610494

CURIEMap:
    SRA_ENA:
        type: identifier
        default:
            map: SRA
            provider: SRA
        map:
            SRA:
                iri_prefix: ""
            ENA:
                iri_prefix: ""
InformationProvider:
    provider:
        SRA:
            request:
                url: "https://ncbi.nlm.nih.gov/sra/{local_id}?report=fullxml&format=text"
                response: application/xml
        ENA:
            request:
                url: "https://www.ebi.ac.uk/ena/browser/api/xml/{local_id}"
                response: application/xml

Proposal B

Prefix specific for Runs
Service specific mapping and InformationProvider URLs
Mismatches IDs will create invalid "IRIs"
Example: SRR:11610494

CURIEMap:
    SRR:
        type: identifier
        default:
            map: SRA
            provider: SRA
        map:
            SRA:
                iri_prefix: "SRR"
    ERR:
        type: identifier
        default:
            map: ENA
            provider: ENA
        map:
            ENA:
                iri_prefix: "ERR"
InformationProvider:
    provider:
        SRA:
            request:
                url: "https://ncbi.nlm.nih.gov/sra/{iri}?report=fullxml&format=text"
                response: application/xml
        ENA:
            request:
                url: "https://www.ebi.ac.uk/ena/browser/api/xml/{iri}"
                response: application/xml

bussec commented 2 years ago

There are still some merge conflicts, but these are trivial to resolve.

javh commented 2 years ago

Merge conflicts should be resolved. This looks incorrect to me, but I didn't touch it - assuming it will be fixed with the SRA/ENA structure:

    SRA:
        type: identifier
        default:
            map: SRA
        map:
            ENA:
                iri_prefix: "https://trace.ncbi.nlm.nih.gov/Traces/sra/?run="

bussec commented 2 years ago

@javh The SRA mislabeling has been resolved.

bussec commented 2 years ago

Ok, validation problems resolved, back to the merge issues. git sledgehammer anyone?

javh commented 2 years ago

I don't see any merge conflicts right now... Thanks, @bussec.

bussec commented 2 years ago

@javh I see a "This branch cannot be rebased due to conflicts" message and its the same on the command line. If we agree that this can be merged as it is I can work on this tomorrow.

javh commented 2 years ago

@bussec is that in your local environment? GitHub doesn't seem to care. If it's just local, then you can try a git reset --hard on your local branch.

bussec commented 2 years ago

Fresh browser instance, new login:

javh commented 2 years ago

Well that is super confusing. I see this:

Want me to try merging now? Is it ready?

bussec commented 2 years ago

Hmm... it seems to be related to the mode of merging: If you switch from "Squash and merge" to "Rebase and merge" the button should get grayed out...

Want me to try merging now? Is it ready?

Ready from my side.

javh commented 2 years ago

Ah, okay. That's fine then.

airr-community / airr-standards

add sequencing_data_pid and index file #564

Proposal A

Proposal B