Include genome (+ annotation)

edkerk commented 3 years ago

As raised by @cshenry:

One thing I would consider to be of utmost importance in such a site is to properly represent the genomes linked to the models. Ideally, I would prefer the see the site maintain its own internal compressed copies of GFF and FASTA files for genomes associated with any models stored there. People routinely use genome IDs… but these IDs go away or genes get recalled and it makes things difficult. I would argue a model is nearly useless without its associated genome, and finding the exact correct genome that should be mapped to a particular published model is one of my greatest pain points in trying to use these models in my own research. You could store protein sequences in the model, which would help, but without the genome, you’re still losing some provenance on where the protein came from.

Seems like a valid point. Not convinced about the compressed copy, I'm always happier to avoid binary files in git.

Midnighter commented 3 years ago

This goes a lot further than my intention with #13. Is there no stable genome identifier at all?

cshenry commented 3 years ago

Both kbase and Patric offer stable genome IDs. Other platforms may also do so. I think a stable id, if truly stable and truly accessible, would work fine.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Moritz E. Beber notifications@github.com Sent: Thursday, August 13, 2020 11:35:38 AM To: MetabolicAtlas/standard-GEM standard-GEM@noreply.github.com Cc: cshenry chenry@mcs.anl.gov; Mention mention@noreply.github.com Subject: Re: [MetabolicAtlas/standard-GEM] Include genome (+ annotation) (#17)

This goes a lot further than my intention with #13https://github.com/MetabolicAtlas/standard-GEM/issues/13. Is there no stable genome identifier at all?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/MetabolicAtlas/standard-GEM/issues/17#issuecomment-673581207, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAHV6IQP3Y45OUSGJJMYFO3SAQI5VANCNFSM4P6TTXHQ.

mihai-sysbio commented 3 years ago

Following the pointers above from @cshenry (thank you), I have found only PATRIC and GenBank to have publicly available genome identifiers; one can only get so far in KBase without having to log in. Personally, I do not consider any of these follow FAIR principles, but having a genome ID in the template is an improvement nevertheless.

Midnighter commented 3 years ago

What about assemblies at NCBI, for example, https://www.ncbi.nlm.nih.gov/assembly/GCF_000007565.2/ lists

GenBank assembly accession:
    GCA_000007565.2 (latest)
RefSeq assembly accession:
    GCF_000007565.2 (latest)

I would hope that those are stable?

They do exist at identifiers.org so that's a plus https://registry.identifiers.org/registry/insdc.gca and https://registry.identifiers.org/registry/refseq.

mihai-sysbio commented 3 years ago

Good idea about RefSeq! Here is the comparison to GenBank:

The GenBank archival sequence database includes publicly available DNA sequences submitted from individual laboratories and large-scale sequencing projects. GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC) along with the European Nucleotide Archive and the DNA Data Bank of Japan (DDBJ). Submitted sequence data is exchanged daily between the three collaborators to achieve comprehensive worldwide coverage. As an archival database, GenBank can be very redundant for some loci. GenBank sequence records are owned by the original submitter and cannot be altered by a third party. RefSeq sequences are not part of the INSDC but are derived from INSDC sequences to provide non-redundant curated data representing our current knowledge of known genes. Some records include sequence information gathered from more than one INSDC record. Records may include sequence, descriptive information, publications, or feature annotation that is not available from any single INSDC record. RefSeq records are owned by NCBI and therefore can be updated as needed to maintain current annotation or to incorporate additional information. Also see the appendix provided in the NCBI Handbook, GenBank chapter. Another distinction is that transcripts and proteins annotated on RefSeq genomic records are instantiated as separate records; in contrast, GenBank only instantiates the proteins annotated on genomic sequence records.

~Moreover, RefSeq has an identifiers.org profile which makes the compact identifier both human readable and useful (eg refseq:NP_012345).~ Nevermind, that is only for protein IDs.

mihai-sysbio commented 3 years ago

race condition over there with the edits :) insdc.gca:GCF_000007565.2 seems to work nicely

mihai-sysbio commented 3 years ago

I believe this issue is resolved in the linked PRs - please reopen this issue if needed.

MetabolicAtlas / standard-GEM

Include genome (+ annotation) #17