Closed schristley closed 2 years ago
@schristley I don't know what the index_file is, I am not familiar with this file, can you explain?
It's a third read spot, e.g. in this SRA record. It's called an index read and some tools may use it when processing the raw data. I don't think it's necessary for more recent versions of CellRanger, for example, but if the user uploads into the SRA record, I figured we might want to annotate it.
I changed
sequencing_data_pid
tosequence_data_ref
for consistency with the object name and the semantics of_ref
in the germline objects.Is this consistent with the intended meaning of
_pid
,_gid
, and_ref
. We are only using_ref
(for external/global persistent identifiers), yes?
I don't know. Do we plan to change the names of study_id
, pub_ids
, etc to be _ref
?
As far as I know, we don't plan to change study_id
and pub_id
. I guess I'm also asking if _pid
/_gid
should be introduced or if that's already covered by _ref
.
As far as I know, we don't plan to change
study_id
andpub_id
. I guess I'm also asking if_pid
/_gid
should be introduced or if that's already covered by_ref
.
I'm not aware of specific decisions on which to use when. Not sure why I used _pid
initially... Should it be sequencing_data_ref
? Or just sequencing_data_id
which matches other fields? Either way, the test data will need to be modified with the right name for the tests to pass.
_ref
or _id
makes sense to me. In the absence of any specific reason not to, I prefer _id
. But, either seems fine given the language of AlleleDescription::allele_description_ref
and AlleleDescription::allele_description_id
in #593.
_ref
or_id
makes sense to me. In the absence of any specific reason not to, I prefer_id
. But, either seems fine given the language ofAlleleDescription::allele_description_ref
andAlleleDescription:: allele_description_id
in #593.
I'm not completely sure I understand _ref
, or why there needs to both _ref
and _id
for AlleleDescription. I too would probably prefer _id
, just to be consistent with existing fields in the core schema, until there were specific rules about naming.
That's a chorus of 2 for _id
. I'll take it. Swapped it to raw_sequence_data_id
.
I have question though. Should this object be SequencingData
instead of RawSequenceData
? It would be more consistent with the SequencingRun
object and I'm skeptical of the word "raw".
I have question though. Should this object be
SequencingData
instead ofRawSequenceData
? It would be more consistent with theSequencingRun
object and I'm skeptical of the word "raw".
That's just a reflection of MiAIRR wording which mentions "raw sequencing data".
All right, I removed the "raw" bit. I'm good to merge this if we're all agreed.
@schristley My thoughts on this:
SRA
CURIE prefix resolves to string containing a query component ?run=
which has two issues:
a. In my interpretation of the W3C CURIE recommendations, the query component is only allowed in the reference, not the prefix
b. The query searches a run along with an SRR (=run) identifier. However,SRA also contains other objects than run, so the SRA prefix could be misleading_id
,_ref
, _gid
, _pid
) consistently and only if the distinction is meaningful. From my perspective an _id
is a catch-all term, that does not imply anything in terms of persistence, global uniqueness or resolvability. In contrast, a _gid
MUST be globally unique and a _pid
MUST be a "real" PID, i.e., persistent, globally unique and resolvable (the various INSDC ID's fulfill these criteria). IIRC we introduced _ref
to indicate that fact that these are external identifiers for which we cannot guarantee persistence, i.e., more like with URLs.So I guess this is the first field with trying to use CURIE for an external identifier instead of for an ontology? Should we punt on the CURIE stuff and just add it to #465 with the other ids?
If we go with _id
, then I think punting the problem to future-us (v2) should be fine (for this particular object at least).
I am not sure we can't get this done in v1.4. I think the main question is:
- The
SRA
CURIE prefix resolves to string containing a query component?run=
which has two issues: a. In my interpretation of the W3C CURIE recommendations, the query component is only allowed in the reference, not the prefix value b. The query searches a run along with an SRR (=run) identifier. However,SRA also contains other objects than run, so the SRA prefix could be misleading
For part b) I think as long as SRA:SRR11610494
resolves using the CURIE this should be OK. The fact that the URL the prefix resolves to can be used with some other ID shouldn't prevent us from using it should it?
For part a) @bussec can you elaborate on your concern - I don't quite follow...
Any objections to merging this mostly as is, but with a flexible definition for sequencing_data_id
, and then bumping the CURIE topic to V2?
@bcorrie @schristley
For part b) I think as long as SRA:SRR11610494 resolves using the CURIE this should be OK. The fact that the URL the prefix resolves to can be used with some other ID shouldn't prevent us from using it should it?
For a description of SRA's data model see here. I would prefer if the SRA
prefix would be able to resolve all SRA objects, not only "runs" (otherwise we might have to introduce e.g. an SRA_SRX
prefix at a later point).
What seems to work is using https://www.ncbi.nlm.nih.gov/sra/
as the replacement string, which would also address (a), but it resolves to the main NCBI portal, not the trace archive directly. If this is not an issue for any of your current applications, this would IMO be the best solution.
For part a) @bussec can you elaborate on your concern - I don't quite follow...
Apologies, it should have been: "the query component is only allowed in the reference, not the prefix value". But re-reading the W3C TR again, it is certainly not prohibited to do this, so we might be fine. My main concern is that third-party libraries/APIs might not tolerate the presence of a query component in an identifier string.
For part b) I think as long as SRA:SRR11610494 resolves using the CURIE this should be OK. The fact that the URL the prefix resolves to can be used with some other ID shouldn't prevent us from using it should it?
For a description of SRA's data model see here. I would prefer if the
SRA
prefix would be able to resolve all SRA objects, not only "runs" (otherwise we might have to introduce e.g. anSRA_SRX
prefix at a later point).What seems to work is using
https://www.ncbi.nlm.nih.gov/sra/
as the replacement string, which would also address (a), but it resolves to the main NCBI portal, not the trace archive directly. If this is not an issue for any of your current applications, this would IMO be the best solution.
That works for me...
For part a) @bussec can you elaborate on your concern - I don't quite follow...
Apologies, it should have been: "the query component is only allowed in the reference, not the prefix value". But re-reading the W3C TR again, it is certainly not prohibited to do this, so we might be fine. My main concern is that third-party libraries/APIs might not tolerate the presence of a query component in an identifier string.
I would have thought that having a prefix involve a query would not be uncommon - we have the same issue with the ENSEMBL CURIE resolution.
ENSG:
type: identifier
default:
map: ENSG
map:
ENSG:
iri_prefix: "https://www.ensembl.org/Multi/Search/Results?q="
@bcorrie @schristley
For a description of SRA's data model see here. I would prefer if the
SRA
prefix would be able to resolve all SRA objects, not only "runs" (otherwise we might have to introduce e.g. anSRA_SRX
prefix at a later point).
How about we use SRR
then? This structure only allows referencing a single SRA run
, so it shouldn't be used for an SRA experiment
or other SRA objects.
In general, I am open for both potential solutions: Either using SRR
as an SRA Read-specific CURIE prefix or using SRA
as a Service-specific one. Unfortunately it seems to be difficult to be consistent across services, as ENA does not provide data type specific URLs. So maybe the question is where to we expect these URIs to land?
Not sure if there are data type specific URLs for all SRA objects either. A generic prefix for SRA might be https://www.ncbi.nlm.nih.gov/sra?term=
I do know if doing programmatic access, say to download files, you wouldn't want to parse these HTML pages, instead would use the E-utilities API to get structured data from a search. So I think that implies, the URIs should land at a human-readable page for that identifier. The run-specific URI brings you to SRA trace page for that specific run. With the generic URI, it takes you to the SRA page for the associated experiment. Both seem acceptable to me.
If these are truly mirrored at INSDC archives, then we should probably use SRR as the prefix (as that is what we would expect the data object to be) and have two maps like we do for NCBITAXON, no?
SRR:
type: curie
default:
map: SRA
map:
SRA:
iri_prefix: "https://www.ncbi.nlm.nih.gov/sra/"
ENA:
iri_prefix: "https://www.ebi.ac.uk/ena/browser/view/"
So SRR:SRR11610494 resolves to either
Both of which are "correct"???
we should probably use SRR as the prefix (as that is what we would expect the data object to be) and have two maps like we do for NCBITAXON, no?
Using SRR to be specific to the data object sounds good, though I don't like the generic IRI prefix, because it's promiscuous and allows more than SRR identifiers. For example, SRR:PRJNA628125
resolves just fine. That's why I prefer my original IRI prefix (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=
) because it can be used for automatic validation. If the id resolves then you know it's a valid SRR identifier, otherwise not, so stuff like SRR:PRJNA628125
can be rejected. It's a small thing but kinda nice.
@bussec Do you know if there is an ENA equivalent for this IRI prefix for SRA run?
Even though SRA and ENA share data and identifiers, I would be tempted for them to have different CURIE names; that is use ERR for ENA, to avoid being SRA centric.
Lastly, if we don't want the SRR specific IRI prefix, but keep it general, I would use SRA/ENA instead because that general IRI prefix can be used for all SRA/ENA identifiers
Had a more extended look at the ENA documentation:
TL;DR: ENA does not use distinct URLs for the different classes in their data model
https://www.ebi.ac.uk/ena/browser/view/<accession_number>
. The only alternative endpoint to /ena/browser/view/
that I could find is /ena/browser/api/xml/
, which delivers the same information in XML. No, there is not /ena/browser/api/json/
:-(Study
and Project
for compatibility with some legacy data model (see here for detail, second paragraph below the table).My interpretation of all of this is that there is no such thing as an IRI in INSDC, there are only accession numbers and URLs to request the respective records. Therefore I see two potential ways forward (leaning towards A):
InformationProvider
URLsCURIEMap
could also be NULLed.SRA_ENA:SRR11610494
CURIEMap:
SRA_ENA:
type: identifier
default:
map: SRA
provider: SRA
map:
SRA:
iri_prefix: ""
ENA:
iri_prefix: ""
InformationProvider:
provider:
SRA:
request:
url: "https://ncbi.nlm.nih.gov/sra/{local_id}?report=fullxml&format=text"
response: application/xml
ENA:
request:
url: "https://www.ebi.ac.uk/ena/browser/api/xml/{local_id}"
response: application/xml
InformationProvider
URLsSRR:11610494
CURIEMap:
SRR:
type: identifier
default:
map: SRA
provider: SRA
map:
SRA:
iri_prefix: "SRR"
ERR:
type: identifier
default:
map: ENA
provider: ENA
map:
ENA:
iri_prefix: "ERR"
InformationProvider:
provider:
SRA:
request:
url: "https://ncbi.nlm.nih.gov/sra/{iri}?report=fullxml&format=text"
response: application/xml
ENA:
request:
url: "https://www.ebi.ac.uk/ena/browser/api/xml/{iri}"
response: application/xml
There are still some merge conflicts, but these are trivial to resolve.
Merge conflicts should be resolved. This looks incorrect to me, but I didn't touch it - assuming it will be fixed with the SRA/ENA structure:
SRA:
type: identifier
default:
map: SRA
map:
ENA:
iri_prefix: "https://trace.ncbi.nlm.nih.gov/Traces/sra/?run="
@javh The SRA mislabeling has been resolved.
Ok, validation problems resolved, back to the merge issues. git sledgehammer
anyone?
I don't see any merge conflicts right now... Thanks, @bussec.
@javh I see a "This branch cannot be rebased due to conflicts" message and its the same on the command line. If we agree that this can be merged as it is I can work on this tomorrow.
@bussec is that in your local environment? GitHub doesn't seem to care. If it's just local, then you can try a git reset --hard
on your local branch.
Fresh browser instance, new login:
Well that is super confusing. I see this:
Want me to try merging now? Is it ready?
Hmm... it seems to be related to the mode of merging: If you switch from "Squash and merge" to "Rebase and merge" the button should get grayed out...
Want me to try merging now? Is it ready?
Ready from my side.
Ah, okay. That's fine then.
closes #435
Also added CURIE as an x-airr format.