How do I decontaminate not only prokaryotes, but unicellular eukaryotes?

SchwarzEM commented 3 years ago

This is related to issue #189 raised earlier.

My student Vladislav and I are using charcoal to decontaminate draft genome assemblies of nematodes. charcoal is wondrously efficient at ridding such assemblies of prokaryotic scaffolds/contigs, if the *.conf file provided to it is in this format:

nematode_draft.fa,d__Eukaryota

However, this retains all 'eukaryota' in the draft assembly, including unicellular eukaryotes such as fungi and oomycetes that it would actually be desirable to decontaminate. An older approach to decontamination involving sourmash (https://github.com/sourmash-bio/sourmash/issues/940) was able to flag such unicellular eukaryotic contaminants, so it would be desirable if charcoal also supported such decontamination.

Is there any straightforward way to make this happen? E.g., by using some taxon in the *.conf file other than d__Eukaryota?

ctb commented 3 years ago

yep! you can do one of two things --

add as many more taxonomic levels as you like, e.g. nematode_draft.fa,d__Eukaryota,phylum_a,class_a,order_a,family_a,genus_a,species_a. If you ask for filtering at (e.g.) class, then the Right Thing will happen. You will need to provide class-level taxonomic ranks for all of the organisms in the reference db as well, of course.

second, you can just make up a new superkingdom for your nematode_draft, and it will flag anything that doesn't match that superkingdom as contaminants. But you have to make sure that there are no legitimate sequences in the reference database that belong to the d__Eukaryota superkingdom, because they would then be flagged as problematic.

(I know this isn't super clear and it's not easy to convey the mental model, so let me try this alternate approach: everything is based on string matching across taxonomic levels, so if you can get the strings/levels right, you have a lot of flexibility at your fingertips.)

SchwarzEM commented 3 years ago

The basic idea of "add as many more taxonomic levels as you like" sounds both reasonable and clear. For instance, writing:

nematode_draft.fa,d__Eukaryota,Metazoa_a

might be perfect for what we need here.

I do have two questions, though.

Why is it that the very first phylogenetic term is prefixed "d__", whereas the subsequent terms are apparently all supposed to be suffixed with "_a"?
Other than by guesswork, is there some way to get a correct controlled vocabulary with additional and more granular taxa that I can add? I can add ",Metazoa_a" through sheer guesswork, and that may well work just fine; but is there any source for a directed acyclic graph of appropriate phylogenetic terms which I can look up?

SchwarzEM commented 3 years ago

Using NCBI's taxonomy browser:

https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html

I've tried running the following *.csv file:

nematode_draft.fa,d__Eukaryota,Opisthokonta_a,Metazoa_a

I don't know if this particular file actually gets effectively interpreted by charcoal, but, charcoal at least did not flatly refuse to run when given this file...

ctb commented 3 years ago

The basic idea of "add as many more taxonomic levels as you like" sounds both reasonable and clear. For instance, writing:

nematode_draft.fa,d__Eukaryota,Metazoa_a

might be perfect for what we need here.

I do have two questions, though.
1. Why is it that the very first phylogenetic term is prefixed "d___", whereas the subsequent terms are apparently all supposed to be suffixed with "__a"?

Oh, that's arbitrary. Exact string matching is used. I put the d__ in there just because GTDB already uses d__Bacteria and d__Archaea.

_a is entirely arbitrary. You can use real taxonomic levels as you wish, but you can also make them up :)

2. Other than by guesswork, is there some way to get a correct controlled vocabulary with additional and more granular taxa that I can add?  I can add ",Metazoa_a" through sheer guesswork, and that may well work just fine; but is there any source for a directed acyclic graph of appropriate phylogenetic terms which I can look up?

NCBI should work fine! No _a etc needed.

best, --titus

SchwarzEM commented 3 years ago

I've just done a run using a *.csv like this, with no "d__" or "_a" legacy text whatsoever, and with phylogenetic terms taken straight from the NCBI taxonomy database (https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html):

nematode_draft.fa,Eukaryota,Opisthokonta,Metazoa

It seems to have worked. So, if that really is what you have in mind, it might be good to add this option to the documentation for charcoal...

ctb commented 3 years ago

excellent! let's leave this open for now, as it is currently the best documentation available for this aspect of things :)

SchwarzEM commented 3 years ago

"Peace through superior Github ticketage"

dib-lab / charcoal

How do I decontaminate not only prokaryotes, but unicellular eukaryotes? #197