Closed thecorz closed 1 year ago
I was just about to post this issue. I am having the same problem.
I noticed that the Kraken2 classification might cause the problem. Looking into the .kreport produced by Kraken, it seems like if the taxonomy rank is incorrect in the next step, the taxa.txt gets the wrong values and taxonkit can't do its thing.
For example:
in the .kreport I got:
0.2110 1 1 S 2069320 Capsicum annuum amalgavirus 1
since it is reported as "S" for species in the next step of the workflow I get an error code about "amalgavirus"
but if its reported as subspecies by "S1" like:
0.2110 1 1 S1 166122 Human endogenous retrovirus K113
it runs fine.
So I guess in the case of "S" classification it assumes only a string with two words so to say and flags the rest as separate taxa which can then not be found by taxonkit.
Did you find any workaround so far for this issue?
I just had a look at this again. I tried to find where the species name is parsed in Bracken source code, to see if I they split it somehow. https://github.com/jenniferlu717/Bracken/blob/88b77381684e4c293f472a9ad2a7f100d6fefba4/src/est_abundance.py#L133 But from what I can see, they parse it whole from a .tsv file, and I can't see that the species name gets split anywhere.
My next step is to clone this repo, and then fix the database record that is causing the problem. I'll run the pipeline with nextflow with the fixed database.
First of all thanks for looking into it but I don't think the problem has to do with the bracken code itself. I believe it has to do with the pipeline code.
I haven't had the chance to dig deeper into it but I think this code here has something to do with it: https://github.com/epi2me-labs/wf-metagenomics/blob/e5dcad0be7d262721b2ca796aff1b8afb6a5d0e2/subworkflows/kraken_pipeline.nf#L140
The problem starts at the taxa.txt where there should only be taxids and not names which is the case in our runs.
This is the taxa.txt and lineage.txt from my output where the pipeline stops: taxa.txt
10677
10678
1985738
SfIV
virus
retrovirus
10847
lineages.txt
10677 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Muvirus;Muvirus mu superkingdom;clade;kingdom;phylum;class;genus;species
10678 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Punavirus;Punavirus P1 superkingdom;clade;kingdom;phylum;class;genus;species
1985738 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Autographiviridae;Studiervirinae;Teseptimavirus;Teseptimavirus T7 superkingdom;clade;kingdom;phylum;class;family;subfamily;genus;species
SfIV
virus
retrovirus
10847 Viruses;Monodnaviria;Sangervirae;Phixviricota;Malgrandaviricetes;Petitvirales;Microviridae;Bullavirinae;Sinsheimervirus;Sinsheimervirus phiX174 superkingdom;clade;kingdom;phylum;class;order;family;subfamily;genus;species
and these are those entries from the .bracken_report.txt
name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads
Muvirus mu 10677 S 21 0 21 0.35593
Punavirus P1 10678 S 23 0 23 0.38983
Teseptimavirus T7 1985738 S 2 0 2 0.03390
Shigella phage SfIV 1407493 S 2 0 2 0.03390
BeAn 58058 virus 67082 S 9 0 9 0.15254
Human endogenous retrovirus K 45617 S 1 0 1 0.01695
Sinsheimervirus phiX174 10847 S 1 0 1 0.01695
I mean it's obvious that if in the name column has more than 2 words it's not filtered right by this awk pipe.
Any thoughts on this?
I believe this issue is the same or related to https://github.com/epi2me-labs/wf-metagenomics/issues/16
We have fixed the awk commands in our development branch to use explicitely tab delimitation. We are aiming to release an update on Wednesday.
Wow yeah, that fixed it. Thanks a lot!
yes, @eparisis you are right, that makes much more sense @cjw85 thanks a lot for the quick reply!
We have now updated this in the latest release.
Hi, Thank you for using the workflow. Could you confirm if this issue has been solved? We'll close this ticket on the assumption things are now resolved.
Closing due to lack of response.
What happened?
We've just hit the error below. After some diagnosing we've noticed that aggregate_lineages_bracken.py is hitting a line in
lineages.txt
that contains only one string:Lu
. This produces an error when the functionupdate_or_create_count()
(see below) tries to unpack it because it's expecting three tab separated fields instead of one string https://github.com/epi2me-labs/wf-metagenomics/blob/master/bin/aggregate_lineages.py#L47We also noticed that lineages.txt is created by taxonkit using the database
ncbi_16s_18s_28s_ITS_kraken2.tar.gz
. So we believe that the culprit is a special character somewhere in thetaxo.t2k
file of that database. Specifically, we think in our case the problem is an organism called " 'Massilia aquatica' Lu et al. 2020 ". We think the single quotation marks are the issue.Operating System
ubuntu 20.04
Workflow Execution
EPI2ME Labs desktop application
Workflow Execution - EPI2ME Labs Versions
No response
Workflow Execution - Execution Profile
No response
Workflow Version
v2.0.1
Relevant log output