Closed edkerk closed 3 years ago
This goes a lot further than my intention with #13. Is there no stable genome identifier at all?
Both kbase and Patric offer stable genome IDs. Other platforms may also do so. I think a stable id, if truly stable and truly accessible, would work fine.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Moritz E. Beber notifications@github.com Sent: Thursday, August 13, 2020 11:35:38 AM To: MetabolicAtlas/standard-GEM standard-GEM@noreply.github.com Cc: cshenry chenry@mcs.anl.gov; Mention mention@noreply.github.com Subject: Re: [MetabolicAtlas/standard-GEM] Include genome (+ annotation) (#17)
This goes a lot further than my intention with #13https://github.com/MetabolicAtlas/standard-GEM/issues/13. Is there no stable genome identifier at all?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/MetabolicAtlas/standard-GEM/issues/17#issuecomment-673581207, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAHV6IQP3Y45OUSGJJMYFO3SAQI5VANCNFSM4P6TTXHQ.
Following the pointers above from @cshenry (thank you), I have found only PATRIC and GenBank to have publicly available genome identifiers; one can only get so far in KBase without having to log in. Personally, I do not consider any of these follow FAIR principles, but having a genome ID in the template is an improvement nevertheless.
What about assemblies at NCBI, for example, https://www.ncbi.nlm.nih.gov/assembly/GCF_000007565.2/ lists
GenBank assembly accession:
GCA_000007565.2 (latest)
RefSeq assembly accession:
GCF_000007565.2 (latest)
I would hope that those are stable?
They do exist at identifiers.org so that's a plus https://registry.identifiers.org/registry/insdc.gca and https://registry.identifiers.org/registry/refseq.
Good idea about RefSeq! Here is the comparison to GenBank:
The GenBank archival sequence database includes publicly available DNA sequences submitted from individual laboratories and large-scale sequencing projects. GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC) along with the European Nucleotide Archive and the DNA Data Bank of Japan (DDBJ). Submitted sequence data is exchanged daily between the three collaborators to achieve comprehensive worldwide coverage. As an archival database, GenBank can be very redundant for some loci. GenBank sequence records are owned by the original submitter and cannot be altered by a third party. RefSeq sequences are not part of the INSDC but are derived from INSDC sequences to provide non-redundant curated data representing our current knowledge of known genes. Some records include sequence information gathered from more than one INSDC record. Records may include sequence, descriptive information, publications, or feature annotation that is not available from any single INSDC record. RefSeq records are owned by NCBI and therefore can be updated as needed to maintain current annotation or to incorporate additional information. Also see the appendix provided in the NCBI Handbook, GenBank chapter. Another distinction is that transcripts and proteins annotated on RefSeq genomic records are instantiated as separate records; in contrast, GenBank only instantiates the proteins annotated on genomic sequence records.
~Moreover, RefSeq has an identifiers.org profile which makes the compact identifier both human readable and useful (eg refseq:NP_012345
).~ Nevermind, that is only for protein IDs.
race condition over there with the edits :)
insdc.gca:GCF_000007565.2
seems to work nicely
I believe this issue is resolved in the linked PRs - please reopen this issue if needed.
As raised by @cshenry:
Seems like a valid point. Not convinced about the compressed copy, I'm always happier to avoid binary files in git.