Sydney-Informatics-Hub / ONT-bacpac-nf

Bacterial profiling workflow for ONT data, written in Nextflow.
GNU General Public License v3.0
1 stars 0 forks source link

`create_phylogenytree_related_files.py` KeyError #55

Closed fredjaya closed 1 month ago

fredjaya commented 1 month ago

Describe the bug

create_phylogeny_tree_related_files process fails due to missing key when subsetting dictionary in the python script

To Reproduce Steps to reproduce the behavior:

  1. Running on https://github.com/Sydney-Informatics-Hub/ONT-bacpac-nf/commit/d1269fde4b4858b6803663281aa0572083f4b719
  2. Running with samplesheet on barcodes: 01, 03, 10, 12, 13, 14, 15
  3. Running with -profile gadi (no high-accuracy)

The .command.sh run:

create_phylogenytree_related_files.py \
  refseq_summary.txt \
  barcode03.k2report barcode01.k2report barcode13.k2report barcode10.k2report barcode12.k2report barcode1
5.k2report barcode14.k2report \
  barcode03_bakta barcode10_bakta barcode01_bakta

Error produced:

Traceback (most recent call last):
  File "/scratch/er01/fj9712/ONT-bacpac-nf_wt/issue-21/bin/create_phylogenytree_related_files.py", line 169, in <module
    main()
  File "/scratch/er01/fj9712/ONT-bacpac-nf_wt/issue-21/bin/create_phylogenytree_related_files.py", line 153, in main
    present_species = sampleID_species_dic[present_sampleid]
KeyError: 'barcode15.k2report'

Expected behavior

Environment:

Additional context The barcode_species_table_mqc.txt produced by the same Nextflow process is missing barcodes 14 and 15, need to check to see if this is related.

sampleID        Species
barcode03       Vibrio_campbellii
barcode01       Vibrio_harveyi
barcode13       Tenacibaculum_mesophilum
barcode10       Tenacibaculum_mesophilum
barcode12       Tenacibaculum_mesophilum

Perhaps an edge case for barcode15 that hasn't been discovered due to it failing in upstream processes #21

fredjaya commented 1 month ago

Error is how the arguments are being parsed in def main().

arguments passed to the python script are the kraken and bakta files for all samples. In this case, 4/7 samples are missing bakta outputs:

all_args = ['barcode03.k2report', 'barcode01.k2report', 'barcode13.k2report', 'barcode10.k2report', 'barcode12.k2report', 'barcode15.k2report', 'barcode14.k2report', 'barcode03_bakta', 'barcode10_bakta', 'barcode01_bakta']

Parsing k2 reports and bakta results assume that all inputs exist, or, the number of bakta and k2 files are exactly equal:

    kraken2_reports = all_args[:half_way]
    bakta_results = all_args[half_way:]

Solutions:

  1. Check why bakta files are missing
  2. Parse arguments with i.e. re instead of it's sys.argv position