Open SchwarzEM opened 3 years ago
yep! you can do one of two things --
add as many more taxonomic levels as you like, e.g. nematode_draft.fa,d__Eukaryota,phylum_a,class_a,order_a,family_a,genus_a,species_a
. If you ask for filtering at (e.g.) class, then the Right Thing will happen. You will need to provide class-level taxonomic ranks for all of the organisms in the reference db as well, of course.
second, you can just make up a new superkingdom for your nematode_draft, and it will flag anything that doesn't match that superkingdom as contaminants. But you have to make sure that there are no legitimate sequences in the reference database that belong to the d__Eukaryota superkingdom, because they would then be flagged as problematic.
(I know this isn't super clear and it's not easy to convey the mental model, so let me try this alternate approach: everything is based on string matching across taxonomic levels, so if you can get the strings/levels right, you have a lot of flexibility at your fingertips.)
The basic idea of "add as many more taxonomic levels as you like" sounds both reasonable and clear. For instance, writing:
nematode_draft.fa,d__Eukaryota,Metazoa_a
might be perfect for what we need here.
I do have two questions, though.
Why is it that the very first phylogenetic term is prefixed "d__", whereas the subsequent terms are apparently all supposed to be suffixed with "_a"?
Other than by guesswork, is there some way to get a correct controlled vocabulary with additional and more granular taxa that I can add? I can add ",Metazoa_a" through sheer guesswork, and that may well work just fine; but is there any source for a directed acyclic graph of appropriate phylogenetic terms which I can look up?
Using NCBI's taxonomy browser:
https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html
I've tried running the following *.csv file:
nematode_draft.fa,d__Eukaryota,Opisthokonta_a,Metazoa_a
I don't know if this particular file actually gets effectively interpreted by charcoal, but, charcoal at least did not flatly refuse to run when given this file...
The basic idea of "add as many more taxonomic levels as you like" sounds both reasonable and clear. For instance, writing:
nematode_draft.fa,d__Eukaryota,Metazoa_a
might be perfect for what we need here.
I do have two questions, though.
1. Why is it that the very first phylogenetic term is prefixed "d___", whereas the subsequent terms are apparently all supposed to be suffixed with "__a"?
Oh, that's arbitrary. Exact string matching is used. I put the d__
in there just because GTDB already uses d__Bacteria
and d__Archaea
.
_a
is entirely arbitrary. You can use real taxonomic levels as you wish, but you can also make them up :)
2. Other than by guesswork, is there some way to get a correct controlled vocabulary with additional and more granular taxa that I can add? I can add ",Metazoa_a" through sheer guesswork, and that may well work just fine; but is there any source for a directed acyclic graph of appropriate phylogenetic terms which I can look up?
NCBI should work fine! No _a
etc needed.
best, --titus
I've just done a run using a *.csv like this, with no "d__" or "_a" legacy text whatsoever, and with phylogenetic terms taken straight from the NCBI taxonomy database (https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html):
nematode_draft.fa,Eukaryota,Opisthokonta,Metazoa
It seems to have worked. So, if that really is what you have in mind, it might be good to add this option to the documentation for charcoal...
excellent! let's leave this open for now, as it is currently the best documentation available for this aspect of things :)
"Peace through superior Github ticketage"
This is related to issue #189 raised earlier.
My student Vladislav and I are using charcoal to decontaminate draft genome assemblies of nematodes. charcoal is wondrously efficient at ridding such assemblies of prokaryotic scaffolds/contigs, if the *.conf file provided to it is in this format:
nematode_draft.fa,d__Eukaryota
However, this retains all 'eukaryota' in the draft assembly, including unicellular eukaryotes such as fungi and oomycetes that it would actually be desirable to decontaminate. An older approach to decontamination involving sourmash (https://github.com/sourmash-bio/sourmash/issues/940) was able to flag such unicellular eukaryotic contaminants, so it would be desirable if charcoal also supported such decontamination.
Is there any straightforward way to make this happen? E.g., by using some taxon in the *.conf file other than d__Eukaryota?