cbg-ethz / shorah

Repo for the software suite ShoRAH (Short Reads Assembly into Haplotypes)
GNU General Public License v3.0
41 stars 14 forks source link

Error running Amplicon with diversity option #83

Open XU-Nuo opened 3 years ago

XU-Nuo commented 3 years ago

Shorah Version:

1.99.2

Command to reproduce:

$ shorah amplicon  -b my_sorted_bam.bam -f amplicon.fasta -d

I have trimmed my read length to equal length prior to this step.

Expected Behavior:

The analysis outputs the entropy.pdf and entropy.csv besides everything else when running without the -d option.

Actual Behavior:

The entropy.pdf and entropy.csv are generated, but the software terminates with error:

[mpileup] 1 samples in 1 input files
/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/amplicon.py:255: UserWarning: mpileup column not fully parsed
  warnings.warn('mpileup column not fully parsed')
Traceback (most recent call last):
  File "/opt/conda/envs/python3/bin/shorah", line 14, in <module>
    main()
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/cli.py", line 210, in main
    args.func(args)
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/cli.py", line 89, in amplicon_run
    amplicon.main(args)
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/amplicon.py", line 387, in main
    h = list(open('coverage.txt'))[0]
IndexError: list index out of range

Output from shorah.log

INFO 2021/05/24 13:36:25 cli.py: main() 204:    shorah amplicon  -b my_sorted_bam.bam -f amplicon.fasta -d
INFO 2021/05/24 13:36:25 cli.py: main() 205:    shorah version:1.99.2

INFO 2021/05/24 13:36:25 amplicon.py: main() 352:   shorah amplicon  -b my_sorted_bam.bam -f amplicon.fasta -d
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 90:  samtools view my_sorted_bam.bam | cut -f 10 > rl.txt
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 99:  Child samtools returned 0
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 229:    n_reads: 500
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 235:    trimmed_mean: 388
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 90:  samtools mpileup  -f amplicon.fasta -d 10000 my_sorted_bam.bam > sample.mpu
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 99:  Child samtools returned 0
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 272:    start: 1, stop: 388
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 286:    highest entropy found at position 22
.
.
.
INFO 2021/05/24 13:36:25 amplicon.py: main() 372:   analysing region from -1 to 387

I have traced back in the code and it turns out that the region index return by function highest_entropy() in src/shorah/amplicon.pydoes not match my expectation. When all my reads have the same length, it will skip the steps finding the max entropy (https://github.com/cbg-ethz/shorah/blob/0334e4d509dc93152945a80d5b1a6dbd4af07e61/src/shorah/amplicon.py#L300 to line 304) and returns -1 for the start index of region. While from my understanding towards the code, the finding entropy part should at least be iterated one time. I found that's because the value of rsto is lower that rsta in this case. So I tried to fix this issue by changing line 298 from

rsto = min(stop - trimmed_mean, highest_ent_pos + 1)

to

rsto = min(stop - trimmed_mean + 2, highest_ent_pos + 1)

Also, in line 305, the high_ent_stop should ends where the region ends as the return value to match the the logic invoked the function, so I subtract 1 from high_ent_start + trimmed_mean, from

high_ent_stop = high_ent_start + trimmed_mean

to

high_ent_stop = high_ent_start + trimmed_mean -1

By applying these two changes, I am able to run the amplicon function on my data with equal length with the -d option, as the region is now fixed for this rare margin case. I don't know if I understand this part right and make the correct modification. Would you verify it for me to see if this is a bug?

niemasd commented 3 years ago

I'm running into the same error, also on v1.99.2, but even in shorah amplicon mode with default arguments:

$ shorah amplicon -b SEARCH-40821__D102066b__I02__210723_A00953_0359_BHGY7MDSX2__003.trimmed.sorted.bam -f NC_045512.2.fas
Traceback (most recent call last):
  File "/usr/local/bin/shorah", line 14, in <module>
    main()
  File "/usr/local/lib/python3.6/site-packages/shorah/cli.py", line 210, in main
    args.func(args)
  File "/usr/local/lib/python3.6/site-packages/shorah/cli.py", line 89, in amplicon_run
    amplicon.main(args)
  File "/usr/local/lib/python3.6/site-packages/shorah/amplicon.py", line 387, in main
    h = list(open('coverage.txt'))[0]
IndexError: list index out of range

So I'm not sure that this error is unique to the -d option, but rather, it seems general to shorah amplicon. Here's the contents of my log file:

INFO 2021/08/10 14:19:35 cli.py: main() 204:    /usr/local/bin/shorah amplicon -b ../SEARCH-40821__D102066b__I02__210723_A00953_0359_BHGY7MDSX2__003.trimmed.sorted.bam -f ../NC_045512.2.fas
INFO 2021/08/10 14:19:35 cli.py: main() 205:    shorah version:1.99.2

INFO 2021/08/10 14:19:35 amplicon.py: main() 352:       /usr/local/bin/shorah amplicon -b ../SEARCH-40821__D102066b__I02__210723_A00953_0359_BHGY7MDSX2__003.trimmed.sorted.bam -f ../NC_045512.2.fas
INFO 2021/08/10 14:19:35 amplicon.py: main() 372:       analysing region from 1 to 29903
DEBUG 2021/08/10 14:19:35 amplicon.py: run_child() 90:  /usr/local/bin/b2w -i 0 -w 29903 -m 28407 -x 10000 -c 0 ../SEARCH-40821__D102066b__I02__210723_A00953_0359_BHGY7MDSX2__003.trimmed.sorted.bam ../NC_045512.2.fas
DEBUG 2021/08/10 14:19:40 amplicon.py: run_child() 99:  Child /usr/local/bin/b2w returned 0
DEBUG 2021/08/10 14:19:40 amplicon.py: main() 380:      b2w returned 0

In addition to the log file, shorah amplicon created the following files, all of which are 0 bytes:

The edits proposed by XU-Nuo seem to fix the issue for -d mode on my end as well, but not in the default parameters

DrYak commented 3 years ago

Thank you for reporting. I hope I'll have time to check this next week...

AdmiralenOla commented 2 years ago

Same issue here. I wonder if the problem might be the arguments sent to b2w.

Lines 377-378 in amplicon.py:

    b2w_args = ' -i 0 -w %d -m %d -x %d -c %i%s %s %s %s' % (ref_length, int(
        min_overlap * ref_length), max_coverage, cov_thrd, d, in_bam, in_fasta, region)

I'm not sure I understand what b2w does exactly, but this parameter combination of window length equal to reference length and minimum overlap equal to 95% of that always creates empty output files (coverage.txt and the *reads.fas files) for me.