gbouras13 / pypolca

Standalone Python re-implementation of the POLCA polisher from MaSuRCA
MIT License
27 stars 1 forks source link

ValueError in utils/report.py #9

Closed oschwengers closed 10 months ago

oschwengers commented 11 months ago

Hi @gbouras13 , I just stumbled over this error polishing 4 genomes. For 1 genome I get a Python ValueError. Here the entire log:

2023-11-20 16:39:48.957 | INFO     | pypolca.utils.validation:instantiate_dirs:23 - Checking the output directory output
2023-11-20 16:39:48.973 | INFO     | pypolca.utils.util:begin_pypolca:84 - pypolca: Standalone Python implementation of the POLCA polisher from MaSuRCA
2023-11-20 16:39:48.974 | INFO     | pypolca.utils.util:begin_pypolca:87 - You are using pypolca version 0.2.0
2023-11-20 16:39:48.974 | INFO     | pypolca.utils.util:begin_pypolca:88 - Repository homepage is https://github.com/gbouras13/pypolca.
2023-11-20 16:39:48.974 | INFO     | pypolca.utils.util:begin_pypolca:89 - Written by George Bouras: george.bouras@adelaide.edu.au adapting the original POLCA code by Aleksey Zimin.
2023-11-20 16:39:48.974 | INFO     | pypolca.utils.util:begin_pypolca:93 - Listing input parameters.
2023-11-20 16:39:48.974 | INFO     | pypolca.utils.util:begin_pypolca:95 - Parameter: --assembly assembly.fna.
2023-11-20 16:39:48.974 | INFO     | pypolca.utils.util:begin_pypolca:95 - Parameter: --reads1 R1.fastq.gz.
2023-11-20 16:39:48.974 | INFO     | pypolca.utils.util:begin_pypolca:95 - Parameter: --reads2 R2.fastq.gz.
2023-11-20 16:39:48.975 | INFO     | pypolca.utils.util:begin_pypolca:95 - Parameter: --output output.
2023-11-20 16:39:48.975 | INFO     | pypolca.utils.util:begin_pypolca:95 - Parameter: --threads 2.
2023-11-20 16:39:48.975 | INFO     | pypolca.utils.util:begin_pypolca:95 - Parameter: --force False.
2023-11-20 16:39:48.975 | INFO     | pypolca.utils.util:begin_pypolca:95 - Parameter: --memory_limit 2G.
2023-11-20 16:39:48.975 | INFO     | pypolca.utils.util:begin_pypolca:95 - Parameter: --no_polish False.
2023-11-20 16:39:48.975 | INFO     | pypolca.utils.util:begin_pypolca:95 - Parameter: --prefix polca.
2023-11-20 16:39:49.293 | INFO     | pypolca.utils.validation:check_dependencies:112 - Samtools v1.18 found.
2023-11-20 16:39:49.333 | INFO     | pypolca.utils.validation:check_dependencies:123 - freebayes v1.3.6 found.
2023-11-20 16:39:49.377 | INFO     | pypolca.utils.validation:check_dependencies:140 - bwa v0.7.17-r1188 found.
2023-11-20 16:39:49.378 | INFO     | pypolca.utils.validation:validate_fasta:46 - Checking that the input file assembly.fna is in FASTA format.
2023-11-20 16:39:49.630 | INFO     | pypolca.utils.validation:validate_fasta:51 - assembly.fna file checked.
2023-11-20 16:39:49.666 | INFO     | pypolca.utils.validation:validate_fastq:73 - FASTQ R1.fastq.gz checked
2023-11-20 16:39:49.700 | INFO     | pypolca.utils.validation:validate_fastq:73 - FASTQ R2.fastq.gz checked
2023-11-20 16:39:49.700 | INFO     | pypolca:run:194 - Checking memory limit of 2G.
2023-11-20 16:39:49.700 | INFO     | pypolca:run:199 - Creating BWA index
2023-11-20 16:39:49.706 | INFO     | pypolca.utils.external_tools:run:50 - Started running bwa index output/temp/assembly.fasta ...
2023-11-20 16:39:50.894 | INFO     | pypolca.utils.external_tools:run:52 - Done running bwa index output/temp/assembly.fasta
2023-11-20 16:39:50.894 | INFO     | pypolca:run:205 - Aligning reads with BWA
2023-11-20 16:39:50.896 | INFO     | pypolca.utils.external_tools:run_to_stdout:59 - Started running bwa mem -SP -t 2 output/temp/assembly.fasta R1.fastq.gz R2.fastq.gz ...
2023-11-20 16:41:51.521 | INFO     | pypolca.utils.external_tools:run_to_stdout:61 - Done running bwa mem -SP -t 2 output/temp/assembly.fasta R1.fastq.gz R2.fastq.gz
2023-11-20 16:41:51.525 | INFO     | pypolca:run:212 - Sorting and indexing alignment file
2023-11-20 16:41:51.527 | INFO     | pypolca.utils.external_tools:run:50 - Started running samtools view -h -@ 2 -b output/temp/temp_bwa.sam -o output/temp/temp_bwa.bam ...
2023-11-20 16:42:07.591 | INFO     | pypolca.utils.external_tools:run:52 - Done running samtools view -h -@ 2 -b output/temp/temp_bwa.sam -o output/temp/temp_bwa.bam
2023-11-20 16:42:07.594 | INFO     | pypolca.utils.external_tools:run:50 - Started running samtools sort -m 2G -@ 2 output/temp/temp_bwa.bam -o output/temp/temp_bwa_sorted.bam ...
2023-11-20 16:42:22.742 | INFO     | pypolca.utils.external_tools:run:52 - Done running samtools sort -m 2G -@ 2 output/temp/temp_bwa.bam -o output/temp/temp_bwa_sorted.bam
2023-11-20 16:42:22.744 | INFO     | pypolca.utils.external_tools:run:50 - Started running samtools index -@ output/temp/temp_bwa_sorted.bam ...
2023-11-20 16:42:22.747 | INFO     | pypolca.utils.external_tools:run:52 - Done running samtools index -@ output/temp/temp_bwa_sorted.bam
2023-11-20 16:42:22.747 | INFO     | pypolca:run:219 - Calling variants.
2023-11-20 16:42:22.748 | INFO     | pypolca.utils.external_tools:run:50 - Started running samtools faidx output/temp/assembly.fasta ...
2023-11-20 16:42:22.764 | INFO     | pypolca.utils.external_tools:run:52 - Done running samtools faidx output/temp/assembly.fasta
2023-11-20 16:42:22.765 | INFO     | pypolca.utils.external_tools:run:50 - Started running freebayes -m 0 --min-coverage 3 -R 0 -p 1 -F 0.2 -E 0 -b output/temp/temp_bwa_sorted.bam -f output/temp/assembly.fasta -v output/polca.vcf ...
2023-11-20 16:43:40.341 | INFO     | pypolca.utils.external_tools:run:52 - Done running freebayes -m 0 --min-coverage 3 -R 0 -p 1 -F 0.2 -E 0 -b output/temp/temp_bwa_sorted.bam -f output/temp/assembly.fasta -v output/polca.vcf
2023-11-20 16:43:40.675 | INFO     | pypolca.utils.fix_consensus_from_vcf:fix_consensus_from_vcf:120 - POLCA has found variants. Fixing
Traceback (most recent call last):
  File "/vol/cb/projects/baktflow/conda/polish-short-pypolca/bin/pypolca", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/vol/cb/projects/baktflow/no_backup/conda/polish-short-pypolca/lib/python3.12/site-packages/pypolca/__init__.py", line 261, in main
    main_cli()
  File "/vol/cb/projects/baktflow/no_backup/conda/polish-short-pypolca/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vol/cb/projects/baktflow/no_backup/conda/polish-short-pypolca/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/vol/cb/projects/baktflow/no_backup/conda/polish-short-pypolca/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vol/cb/projects/baktflow/no_backup/conda/polish-short-pypolca/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vol/cb/projects/baktflow/no_backup/conda/polish-short-pypolca/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vol/cb/projects/baktflow/no_backup/conda/polish-short-pypolca/lib/python3.12/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vol/cb/projects/baktflow/no_backup/conda/polish-short-pypolca/lib/python3.12/site-packages/pypolca/__init__.py", line 234, in run
    create_report(vcf, assembly_temp, report_file)
  File "/vol/cb/projects/baktflow/no_backup/conda/polish-short-pypolca/lib/python3.12/site-packages/pypolca/utils/report.py", line 38, in create_report
    if int(parts[3]) == 0 and int(parts[5]) > 1:
                              ^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '11,19'

Haven't gone into the rabbit hole of VCF, but looks like https://github.com/gbouras13/pypolca/blob/b1d77e8e255e1cc5e2b4945bc57b34f4090a0927/src/pypolca/utils/report.py#L38C17-L38C61 is either catching the wrong ':' -separated field or it's currently unexpected that this field can have a list of values?

Unfortunately, I cannot take a deeper look into this myself right now, but at least, I wanted to let you know... Thanks and best regards

gbouras13 commented 11 months ago

Hi @oschwengers ,

Very interesting - have you tried POLCA (and does it work for that)?

I've tested pypolca on hundreds of samples (well at least via Hybracter) and never come across this.

I'm happy to have a look at this but I will need the reads and assembly - george.bouras@adelaide.edu.au would be best.

George

oschwengers commented 11 months ago

Hi, no, I haven't tried the original POLCA yet. Weird enough, a 2nd run finished w/o any error at all. As I cannot exclude any unrelated side effects, I'll close this for now. If I face this sort of error in a reproducible way, I'll re-open this issue.

Sorry, for the rash issue. Oliver

npbhavya commented 10 months ago

Running into an error

**return ctx.invoke(self.callback, **ctx.params)

File "/home/nala0006/miniconda3/envs/hybracter/lib/python3.12/site-packages/hybracter/workflow/conda/ac7d25bd198f954837b82286b3741471_/lib/python3.10/site-packages/click/$ return _callback(*args, **kwargs) File "/home/nala0006/miniconda3/envs/hybracter/lib/python3.12/site-packages/hybracter/workflow/conda/ac7d25bd198f954837b82286b3741471/lib/python3.10/site-packages/click/$ return f(get_currentcontext(), *args, **kwargs) File "/home/nala0006/miniconda3/envs/hybracter/lib/python3.12/site-packages/hybracter/workflow/conda/ac7d25bd198f954837b82286b3741471/lib/python3.10/site-packages/pypolc$ create_report(vcf, assembly_temp, reportfile) File "/home/nala0006/miniconda3/envs/hybracter/lib/python3.12/site-packages/hybracter/workflow/conda/ac7d25bd198f954837b82286b3741471/lib/python3.10/site-packages/pypolc$ if int(parts[3]) == 0 and int(parts[5]) > 1: ValueError: invalid literal for int() with base 10: '80,44' ================================================================================**

gbouras13 commented 10 months ago

Hi @oschwengers @npbhavya ,

This was definitely a bug in Pypolca after all.

What happened is that in writing the final report, if a line in the VCF had multiple alleles, then parts[5] would be a non-integer like 80,44 and break line 38 in report.py

if int(parts[3]) == 0 and int(parts[5]) > 1:

I've added exception handling and it should be good to go now - please upgrade to v0.2.1.

George

oschwengers commented 10 months ago

Thanks @gbouras13 for taking care of this and for the quick fix! I'm very much looking forward to further test PyPOLCA.