epi2me-labs / wf-metagenomics

Metagenomic classification of long-read sequencing data
Other
55 stars 23 forks source link

[Bug]: error while aggregating kraken report with bracken #24

Closed biofranckus closed 1 year ago

biofranckus commented 1 year ago

What happened?

Hello,

I am encountering a bug with the latest versions of the workflow when aggregating kraken reports with bracken. Something is wrong with the parsing of the kraken report or on the lineage, I can't identify precisely the source of the problem. Maybe one of you can help me.

I specify that I don't have any problem with version v1.1.4

Operating System

ubuntu 20.04

Workflow Execution

Command line

Workflow Execution - EPI2ME Labs Versions

No response

Workflow Execution - Execution Profile

Docker

Workflow Version

v.2.0.3

Relevant log output

nextflow run epi2me-labs/wf-metagenomics --max_len 2000 --min_len 1000 --threads 16 --bracken_level "G" -r v2.0.3 --fastq fastq_pass/ --out_dir output_last_v2.0.3

Error executing process > 'kraken_pipeline:bracken (24)'

Caused by:
  Process `kraken_pipeline:bracken (24)` terminated with an error exit status (1)

Command executed:

  run_bracken.py         "database_dir"         "reports.24/barcode02.kreport.txt"         "1000"         "G"         "barcode02.bracken_report.txt"
  mv "reports.24/barcode02.kreport_bracken_species.txt" . || echo "no bracken report"

  awk -F '(     ' -v OFS='      ' '{ print $2,$6 }' "barcode02.bracken_report.txt"         | awk -F '   ' -v OFS='      ' 'NR!=1 {print}'         | tee taxacounts.txt         | awk -F '' -v OFS='"/home/' '{ print $1 }' > taxa.txt
  taxonkit         --data-dir taxonomy_dir         lineage -R taxa.txt  > lineages.txt

  aggregate_lineages_bracken.py         -i "lineages.txt" -b "taxacounts.txt"         -u "reports.24/barcode02.kreport.txt"         -p "barcode02.kraken2"

  file1=`cat *.json`
  echo "{"'"barcode02"'": "$file1"}" >> "barcode02.24.json"
  cp "barcode02.24.json" "reports.24/barcode02.json"

Command exit status:
  1

Command output:
  b' >> Checking for Valid Options...\n >> Running Bracken \n      >> python src/est_abundance.py -i reports.24/barcode02.kreport.txt -o barcode02.bracken_report.txt -k database_dir/database1000mers.kmer_distrib -l G -t 0\nPROGRAM START TIME: 11-22-2022 18:32:43\nBRACKEN SUMMARY (Kraken report: reports.24/barcode02.kreport.txt)\n    >>> Threshold: 0 \n    >>> Number of genuses in sample: 453 \n\t  >> Number of genuses with reads > threshold: 453 \n\t  >> Number of genuses with reads < threshold: 0 \n    >>> Total reads in sample: 22330\n\t  >> Total reads kept at genuses level (reads > threshold): 17909\n\t  >> Total reads discarded (genuses reads < threshold): 0\n\t  >> Reads distributed: 1407\n\t  >> Reads not distributed (eg. no genuses above threshold): 3014\n\t  >> Unclassified reads: 0\nBRACKEN OUTPUT PRODUCED: barcode02.bracken_report.txt\nPROGRAM END TIME: 11-22-2022 18:32:43\n  Bracken complete.\n'no bracken report

Command error:
  b'>> Checking report file: reports.24/barcode02.kreport.txt\n'mv: cannot stat 'reports.24/barcode02.kreport_bracken_species.txt': No such file or directory
  18:32:50.221 [WARN] taxid 2787116 was deleted
  Traceback (most recent call last):
    File "/home/administrateur/.nextflow/assets/epi2me-labs/wf-metagenomics/bin/aggregate_lineages_bracken.py", line 41, in update_or_create_count
      tax_id, lineage, ranks = entry.rstrip().split('\t')
  ValueError: not enough values to unpack (expected 3, got 1)

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/home/administrateur/.nextflow/assets/epi2me-labs/wf-metagenomics/bin/aggregate_lineages_bracken.py", line 165, in <module>
      execute()
    File "/home/administrateur/.nextflow/assets/epi2me-labs/wf-metagenomics/bin/aggregate_lineages_bracken.py", line 156, in execute
      main(
    File "/home/administrateur/.nextflow/assets/epi2me-labs/wf-metagenomics/bin/aggregate_lineages_bracken.py", line 98, in main
      entries = update_or_create_count(line, entries, bracken_counts)
    File "/home/administrateur/.nextflow/assets/epi2me-labs/wf-metagenomics/bin/aggregate_lineages_bracken.py", line 46, in update_or_create_count
      sys.stderr('Error: unknown lineage {}'.format(entry))
  TypeError: '_io.TextIOWrapper' object is not callable

Work dir:
  /home/administrateur/dev/test_wf-metagenomics/work/f7/f306c88c184fafff74910a1f00553a

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.

## The first lines of the kraken report look fine
head barcode02.kreport.txt
0.0000  0       0       U       0       unclassified
100.0000        22330   0       R       1       root
100.0000        22330   0       R1      131567    cellular organisms
100.0000        22330   34      D       2           Bacteria
66.5741 14866   60      D1      1783272       Terrabacteria group
63.9006 14269   219     P       1239            Firmicutes
58.0430 12961   16      C       186801            Clostridia
57.7564 12897   597     O       186802              Eubacteriales
30.8106 6880    1095    F       186803                Lachnospiraceae
5.7456  1283    110     G       572511                  Blautia
sarahjeeeze commented 1 year ago

Hi, thanks for reporting this error. I am just trying to recreate the error. In the directory work/f7/f306c88c184fafff74910a1f00553a there should be a lineages.txtfile would you be able to share it or check if there are any rows that are just one number? If so I think it was meant to report an unknown lineage here as in its found an entry in the kraken database that for some reason isn't reflected in the tax id database, so it can't find the lineage. I will improve the error message it gives here.

biofranckus commented 1 year ago

Thanks @sarahjeeeze for your reply. Indeed, you're right. The taxid 2787116 that was supposed to be deleted according to the warning message is in the lineages.txt file without associated taxonomy. It could resolve this bug if the deleted taxid was also removed in the lineages.txt

2767893 cellular organisms;Bacteria;Terrabacteria group;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lentilactobacillus no rank;superkingdom;clade;phylum;class;order;family;genus 2787116
2794998 cellular organisms;Bacteria;Proteobacteria;delta/epsilon subdivisions;Deltaproteobacteria;Desulfovibrionales;Desulfovibrionaceae;Maridesulfovibrio no rank;superkingdom;phylum;subphylum;class;order;family;genus

sarahjeeeze commented 1 year ago

Thank you, I will improve the error handling here but also try to update our default database to ensure as many entries as possible are reflected in the tax id db that we are pulling from ncbi. Will let you know once done.

biofranckus commented 1 year ago

Thanks again for considering the problem. Maybe the temporary and faster way to solve this error is to filter out all the lines that didn't have 3 columns in the lineages.txt file to avoid bug during the aggregating process.