clarification on biosample: is this an INSDC ID or an EBI biosample ID

cmungall commented 3 years ago

I think this record http://bioregistry.io/registry/biosample

is a bit confusing w.r.t whether it is for EBI biosample IDs or INSDC biosample IDs

Of course, for samples, EBI/ DDBJ/NCBI all use the same sample IDs. The records are largely the same, but I have cases where the metadata is different.

I think the current entry is implicitly about the EBI biosample rendering:

The BioSample Database stores information about biological samples used in molecular experiments, such as sequencing, gene expression or proteomics. It includes reference samples, such as cell lines, which are repeatedly used in experiments. Accession numbers for the reference samples will be exchanged with a similar database at NCBI, and DDBJ (Japan). Record access may be affected due to different release cycles and inter-institutional synchronisation.

If this record is really about the EBI one then I think this needs to be made much more explicit, and we should add new entires for NCBI biosample and DDBJ biosample

I don't think this would be a good decision though. I think it would be better to abstract this and to make it explicitly about INSDC biosample, and to treat EBI/DDBJ/NCBI as alternate resolvers for the same ID

cmungall commented 3 years ago

The pattern is:

^SAM[NED](\w)?\d+$

The N/E/D indicates which of the 3 partner databases registered the sample.

This applies to other INSDC types too, see https://ena-docs.readthedocs.io/en/latest/submit/general-guide/accessions.html

cmungall commented 3 years ago

For contrast, for INSDC SRA/ENA we have

The former seems to be truly the INSDC record (although it uses the NCBI/DDBJ centric "SRA", and lists NCBI as homepage)

The latter is just EBI ENA. Although of course the IDs are interchangeable

Poor DDBJ gets no mentions:

And there are no entries for bioproject/biosample/sequence reads/experiments for poor NCBI:

I think this partly reflects identifiers.org euro-centric view...

I think there needs to be clear and consistent treatment of INSDC identifiers across the registry

Either lump into one for each type OR split, and make 3 consistent entries that are linked together somehow

cmungall commented 3 years ago

To confuse things further, there is

http://bioregistry.io/registry/mgnify.samp

but mgnify doesn't mint its own IDs. It uses secondary INSDC sample IDs.

cthoyt commented 2 years ago

To confuse things further, there is

http://bioregistry.io/registry/mgnify.samp

but mgnify doesn't mint its own IDs. It uses secondary INSDC sample IDs.

There's a metadata field in the bioregistry to mark stuff like this (provides). Another example is ctd.gene, which comes from identifiers.org, but is actually is just NCBI Gene identifiiers.

cmungall commented 2 years ago

I would like to bring this up at the workshop, it's a priority for the National Microbiome Data Collaborative

cthoyt commented 2 years ago

I'm open to whatever solution works best. I think the technical infrastructure is all there so I think we can keep time at the workshop to be more philosophical

biopragmatics / bioregistry

clarification on biosample: is this an INSDC ID or an EBI biosample ID #108