epi2me-labs / wf-metagenomics

Metagenomic classification of long-read sequencing data
Other
58 stars 23 forks source link

[Bug]: Error: unknown lineage Lu #19

Closed thecorz closed 1 year ago

thecorz commented 1 year ago

What happened?

We've just hit the error below. After some diagnosing we've noticed that aggregate_lineages_bracken.py is hitting a line in lineages.txt that contains only one string: Lu. This produces an error when the function update_or_create_count() (see below) tries to unpack it because it's expecting three tab separated fields instead of one string https://github.com/epi2me-labs/wf-metagenomics/blob/master/bin/aggregate_lineages.py#L47

We also noticed that lineages.txt is created by taxonkit using the database ncbi_16s_18s_28s_ITS_kraken2.tar.gz. So we believe that the culprit is a special character somewhere in the taxo.t2k file of that database. Specifically, we think in our case the problem is an organism called " 'Massilia aquatica' Lu et al. 2020 ". We think the single quotation marks are the issue.

Operating System

ubuntu 20.04

Workflow Execution

EPI2ME Labs desktop application

Workflow Execution - EPI2ME Labs Versions

No response

Workflow Execution - Execution Profile

No response

Workflow Version

v2.0.1

Relevant log output

Error executing process > 'kraken_pipeline:bracken (1)'

Caused by:
  Process `kraken_pipeline:bracken (1)` terminated with an error exit status (1)

Command executed:

  run_bracken.py         "database_dir"         "reports.1/8_CB541_substrate_Qiagen_40.kreport.txt"         "1000"         "S"         "8_CB541_substrate_Qiagen_40.bracken_report.txt"
  mv "reports.1/8_CB541_substrate_Qiagen_40.kreport_bracken_species.txt" .
  awk '{ print $3,$7}' "8_CB541_substrate_Qiagen_40.bracken_report.txt" |  awk 'NR!=1 {print}' > taxacounts.txt
  awk '{print $3}' "8_CB541_substrate_Qiagen_40.bracken_report.txt" |  awk 'NR!=1 {print}' > taxa.txt
  taxonkit         --data-dir taxonomy_dir         lineage -R taxa.txt  > lineages.txt
  aggregate_lineages_bracken.py -i "lineages.txt" -b "taxacounts.txt" -p "8_CB541_substrate_Qiagen_40.kraken2"
  file1=`cat *.json`
  echo "{"'"8_CB541_substrate_Qiagen_40"'": "$file1"}" >> "8_CB541_substrate_Qiagen_40.1.json"
  cp "8_CB541_substrate_Qiagen_40.1.json" "reports.1/8_CB541_substrate_Qiagen_40.json"

Command exit status:
  1

Command output:
  b' >> Checking for Valid Options...\n >> Running Bracken \n      >> python src/est_abundance.py -i reports.1/8_CB541_substrate_Qiagen_40.kreport.txt -o 8_CB541_substrate_Qiagen_40.bracken_report.txt -k database_dir/database1000mers.kmer_distrib -l S -t 0\nPROGRAM START TIME: 11-11-2022 18:39:50\nBRACKEN SUMMARY (Kraken report: reports.1/8_CB541_substrate_Qiagen_40.kreport.txt)\n    >>> Threshold: 0 \n    >>> Number of species in sample: 203 \n\t  >> Number of species with reads > threshold: 203 \n\t  >> Number of species with reads < threshold: 0 \n    >>> Total reads in sample: 4308\n\t  >> Total reads kept at species level (reads > threshold): 751\n\t  >> Total reads discarded (species reads < threshold): 0\n\t  >> Reads distributed: 174\n\t  >> Reads not distributed (eg. no species above threshold): 328\n\t  >> Unclassified reads: 3056\nBRACKEN OUTPUT PRODUCED: 8_CB541_substrate_Qiagen_40.bracken_report.txt\nPROGRAM END TIME: 11-11-2022 18:39:50\n  Bracken complete.\n'Error: unknown lineage Lu       

Command error:
  b'>> Checking report file: reports.1/8_CB541_substrate_Qiagen_40.kreport.txt\n'

Work dir:
  /home/concertbio/epi2melabs-data/nextflow/instances/2022-11-11-18-38_wf-metagenomics_kFhzGMeFDsQWUdWWiJrs2U/work/f0/beb156eb6a8d2c4ad046e6ced954b5

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
eparisis commented 1 year ago

I was just about to post this issue. I am having the same problem.

I noticed that the Kraken2 classification might cause the problem. Looking into the .kreport produced by Kraken, it seems like if the taxonomy rank is incorrect in the next step, the taxa.txt gets the wrong values and taxonkit can't do its thing.

For example: in the .kreport I got: 0.2110 1 1 S 2069320 Capsicum annuum amalgavirus 1 since it is reported as "S" for species in the next step of the workflow I get an error code about "amalgavirus"

but if its reported as subspecies by "S1" like: 0.2110 1 1 S1 166122 Human endogenous retrovirus K113 it runs fine.

So I guess in the case of "S" classification it assumes only a string with two words so to say and flags the rest as separate taxa which can then not be found by taxonkit.

Did you find any workaround so far for this issue?

thecorz commented 1 year ago

I just had a look at this again. I tried to find where the species name is parsed in Bracken source code, to see if I they split it somehow. https://github.com/jenniferlu717/Bracken/blob/88b77381684e4c293f472a9ad2a7f100d6fefba4/src/est_abundance.py#L133 But from what I can see, they parse it whole from a .tsv file, and I can't see that the species name gets split anywhere.

thecorz commented 1 year ago

My next step is to clone this repo, and then fix the database record that is causing the problem. I'll run the pipeline with nextflow with the fixed database.

eparisis commented 1 year ago

First of all thanks for looking into it but I don't think the problem has to do with the bracken code itself. I believe it has to do with the pipeline code.

I haven't had the chance to dig deeper into it but I think this code here has something to do with it: https://github.com/epi2me-labs/wf-metagenomics/blob/e5dcad0be7d262721b2ca796aff1b8afb6a5d0e2/subworkflows/kraken_pipeline.nf#L140

The problem starts at the taxa.txt where there should only be taxids and not names which is the case in our runs.

This is the taxa.txt and lineage.txt from my output where the pipeline stops: taxa.txt

10677
10678
1985738
SfIV
virus
retrovirus
10847

lineages.txt

10677   Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Muvirus;Muvirus mu  superkingdom;clade;kingdom;phylum;class;genus;species
10678   Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Punavirus;Punavirus P1  superkingdom;clade;kingdom;phylum;class;genus;species
1985738 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Autographiviridae;Studiervirinae;Teseptimavirus;Teseptimavirus T7   superkingdom;clade;kingdom;phylum;class;family;subfamily;genus;species
SfIV        
virus       
retrovirus      
10847   Viruses;Monodnaviria;Sangervirae;Phixviricota;Malgrandaviricetes;Petitvirales;Microviridae;Bullavirinae;Sinsheimervirus;Sinsheimervirus phiX174 superkingdom;clade;kingdom;phylum;class;order;family;subfamily;genus;species

and these are those entries from the .bracken_report.txt


name    taxonomy_id taxonomy_lvl    kraken_assigned_reads   added_reads new_est_reads   fraction_total_reads
Muvirus mu  10677   S   21  0   21  0.35593
Punavirus P1    10678   S   23  0   23  0.38983
Teseptimavirus T7   1985738 S   2   0   2   0.03390
Shigella phage SfIV 1407493 S   2   0   2   0.03390
BeAn 58058 virus    67082   S   9   0   9   0.15254
Human endogenous retrovirus K   45617   S   1   0   1   0.01695
Sinsheimervirus phiX174 10847   S   1   0   1   0.01695

I mean it's obvious that if in the name column has more than 2 words it's not filtered right by this awk pipe.

Any thoughts on this?

cjw85 commented 1 year ago

I believe this issue is the same or related to https://github.com/epi2me-labs/wf-metagenomics/issues/16

We have fixed the awk commands in our development branch to use explicitely tab delimitation. We are aiming to release an update on Wednesday.

eparisis commented 1 year ago

Wow yeah, that fixed it. Thanks a lot!

thecorz commented 1 year ago

yes, @eparisis you are right, that makes much more sense @cjw85 thanks a lot for the quick reply!

sarahjeeeze commented 1 year ago

We have now updated this in the latest release.

nggvs commented 1 year ago

Hi, Thank you for using the workflow. Could you confirm if this issue has been solved? We'll close this ticket on the assumption things are now resolved.

nggvs commented 1 year ago

Closing due to lack of response.