Open cmungall opened 3 years ago
The pattern
is:
^SAM[NED](\w)?\d+$
The N/E/D indicates which of the 3 partner databases registered the sample.
This applies to other INSDC types too, see https://ena-docs.readthedocs.io/en/latest/submit/general-guide/accessions.html
For contrast, for INSDC SRA/ENA we have
The former seems to be truly the INSDC record (although it uses the NCBI/DDBJ centric "SRA", and lists NCBI as homepage)
The latter is just EBI ENA. Although of course the IDs are interchangeable
Poor DDBJ gets no mentions:
And there are no entries for bioproject/biosample/sequence reads/experiments for poor NCBI:
I think this partly reflects identifiers.org euro-centric view...
I think there needs to be clear and consistent treatment of INSDC identifiers across the registry
Either lump into one for each type OR split, and make 3 consistent entries that are linked together somehow
To confuse things further, there is
http://bioregistry.io/registry/mgnify.samp
but mgnify doesn't mint its own IDs. It uses secondary INSDC sample IDs.
To confuse things further, there is
http://bioregistry.io/registry/mgnify.samp
but mgnify doesn't mint its own IDs. It uses secondary INSDC sample IDs.
There's a metadata field in the bioregistry to mark stuff like this (provides
). Another example is ctd.gene
, which comes from identifiers.org, but is actually is just NCBI Gene identifiiers.
I would like to bring this up at the workshop, it's a priority for the National Microbiome Data Collaborative
I'm open to whatever solution works best. I think the technical infrastructure is all there so I think we can keep time at the workshop to be more philosophical
I think this record http://bioregistry.io/registry/biosample
is a bit confusing w.r.t whether it is for EBI biosample IDs or INSDC biosample IDs
Of course, for samples, EBI/ DDBJ/NCBI all use the same sample IDs. The records are largely the same, but I have cases where the metadata is different.
I think the current entry is implicitly about the EBI biosample rendering:
The BioSample Database stores information about biological samples used in molecular experiments, such as sequencing, gene expression or proteomics. It includes reference samples, such as cell lines, which are repeatedly used in experiments. Accession numbers for the reference samples will be exchanged with a similar database at NCBI, and DDBJ (Japan). Record access may be affected due to different release cycles and inter-institutional synchronisation.
If this record is really about the EBI one then I think this needs to be made much more explicit, and we should add new entires for NCBI biosample and DDBJ biosample
I don't think this would be a good decision though. I think it would be better to abstract this and to make it explicitly about INSDC biosample, and to treat EBI/DDBJ/NCBI as alternate resolvers for the same ID