implement a less hacky way of specifying eukaryotes / other lineages not in database

ctb commented 4 years ago

since we're currently using the GTDB bacterial/archaeal genome database for decontamination, the automated classification approach doesn't correctly classify eukaryotes. so when cleaning eukaryotic bins from e.g. tara-delmont, we are using a hack (#20) where we specify eukaryotic lineages in a provided-lineages file - see conf/tara-delmont-provided-lineages.csv in the fix_provided_lin branch for an example. each line of that file looks like this:

TARA_PSW_MAG_00136.fa,d__Eukaryota,p__None,c__None,o__None,f__None,g__None,s__None

(you could perfectly well put in real lineages here, but there's no need; in this case, there weren't genus-level lineages for many of them.)

this is not great, because you are basically adding nonsense lineages and it doesn't make sense except in terms of internal implementation details.

I'd like to find an alternative / better approach.

we don't want to hardcode eukaryotes, because there's no particular reason not to allow finer grained euk decontamination in the future.

one option would be to have a special keyword 'ignore' or something, OR to allow NA in each column as a special. (I don't like blanks because blanks are hard for non-programmers to verify with.)

we could also provide our own little language / YAML config file where you could tell charcoal to do certain things to certain inputs, but ... that adds a lot of complexity.

I suppose another alternative might be to more explicitly separate out the classification and decontamination steps. that is, we could provide a k-mer based classification step and/or a GTDB-Tk step that produced a provided-lineages table; and then decontamination would always ask for provided lineages in some format, and would assume that if one wasn't provided that it should just remove anything that matched. that is,

charcoal classify would produce a lineages table that could be used as a provided-lineages file.

charcoal decontam or charcoal filter would then do the actual work of decontam.

ref #20

taylorreiter commented 4 years ago

Is there a good database for euks that we could run alongside gtdb? I know we have the RNA databases, but we haven't vetted that for contamination (https://osf.io/qk5th/).

ctb commented 4 years ago

ooooooh @bluegenes any thoughts on databases?

bluegenes commented 4 years ago

OrthoDB might be worth exploring?

OrthoDB orthologous genes (protein sequences) may not be useful, but the first step of building OrthoDB is the selection of representative genomes: we could find these to build a db.

It's not clear whether they do any contamination checking or filtering during selection.

from v9 paper:

"Protein-coding gene translations were retrieved for vertebrates and plants from Ensembl (13), for arthropods from AgripestBase, AphidBase (14), BeetleBase (15), DiamondBackMoth-DB (16), FlyBase (17), Hymenoptera Genome Database (18), NCBI (19), SilkDB (20), VectorBase (21), wFleaBase (22), as well as the i5K pilot project (23) and several other genome consortia. Gene sets for the additional metazoan species were retrieved from the Joint Genome Institute (24). The fungal and viral gene sets were sourced from UniProt (25). We retrieved bacterial and archaeal genomes from Ensembl Bacteria (26), and selected 3663 bacteria and 345 archaea for orthology analysis that have the most complete annotations, as estimated by the proxy of having the most of complete universal single-copy genes (27,28), and that best sample the genetic diversity to ensure the maximum number of clades are represented and to reduce oversampling of certain clades. In the case of strains of the same species the gene set with the highest number of unique genes was kept for orthology analysis."

v10 paper does not describe any additional info on genome selection - just describes expansion of lineages to: "1271 eukaryotes, 5609 bacteria, 404 archaea and 6488 viruses"

software for finding orthologs is described in v8 paper, but source link broken

edit: link to species file: https://v101.orthodb.org/download/odb10v1_species.tab.gz

ctb commented 4 years ago

chatted with @bluegenes about this today - very promising idea:

the first step of building OrthoDB is the selection of representative genomes: we could find these to build a db.

we may have to run charcoal on them first tho 😂

dib-lab / charcoal

implement a less hacky way of specifying eukaryotes / other lineages not in database #30