biocodellc / geome-db

MIT License
2 stars 0 forks source link

Replace default phylum controlled vocabulary with NCBI Phyla #58

Open ericcrandall opened 3 years ago

ericcrandall commented 3 years ago

As we are pushing metadata to NCBI now, it would make sense that the default controlled vocabulary contain phyla from the NCBI taxonomy, which can be obtained with the R taxize command and is attached.

library(taxize) ncbi_phyla <- downstream(sci_id = "cellular organisms", db = "ncbi", downto = "phylum", intermediate = F) NCBI_Phyla.csv

ericcrandall commented 3 years ago

Or if not replace, at least add these to it.

ericcrandall commented 3 years ago

Or, thinking further, we won't be pushing taxonomy back to NCBI at all, but we should still include their phyla in our default controlled vocabulary.

jdeck88 commented 3 years ago

Attached is a comparison of phyla between GEOME and NCBI.... there is not as much agreement between the two as i would like to see! Several options:

  1. Add all the NCBI phyla to the GEOME list and we end up with a list that is about 40 names longer.
  2. Only use the NCBI phyla and force all future uploads into GEOME to adopt the new taxonomy (will not change existing data unless a user tries to reload).
  3. Try and rectify the taxonomy in some way.

NCBI_Phyla.csv

jdeck88 commented 3 years ago

Chris probably knows more than I do, but it seems like Catalog of Life is trying to reconcile ITIS and GBIF and may be the best authority. What did Biocode use as the source of phyla originally?

But if we are pushing data to NCBI then we really should include their taxonomy. For example in the datathon, since we were adding metadata retrospectively to SRA projects, we queried NCBI taxonomy. So I would favor option 1. In theory, the phyla could be reconciled later, right?

Eric

ericcrandall commented 3 years ago

Yikes, next time I'll reply on GitHub