MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

"IndexError when selecting alignment in al_pass_2 in ShortStack" #136

Closed zhongxiangSu closed 1 year ago

zhongxiangSu commented 1 year ago

Dear ShortStack developers, I encountered an error while using ShortStack to analyze small RNA data. The error seems to occur during the second round of read alignment (al_pass_2). Specifically, I got the following IndexError when ShortStack tried to randomly select an alignment from the alignments list:

reads processed: 15392282 reads with at least one alignment: 15181845 (98.63%) reads that failed to align: 210437 (1.37%) Reported 266248020 alignments multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/bin/ShortStack", line 925, in al_pass_2 lib_counters = al_block_process2(lib_counters, File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/bin/ShortStack", line 1160, in al_block_process2 chosen = random.choices(alignments, weights=probs)[0] File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/lib/python3.10/random.py", line 533, in choices total = cum_weights[-1] + 0.0 # convert to float IndexError: list index out of range """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/bin/ShortStack", line 3582, in merged_bam, read_count = align(args, fai, trimmed_files) File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/bin/ShortStack", line 705, in align rs2_results = pool.starmap(al_pass_2, File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/lib/python3.10/multiprocessing/pool.py", line 375, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/lib/python3.10/multiprocessing/pool.py", line 774, in get raise self._value IndexError: list index out of range

This appears to be caused by an empty cum_weights list, leading to incorrect indexing into alignments. I have tried adjusting parameters to get more candidate alignments, reducing thread counts, and updating to the latest ShortStack version, but the problem persists. Could you please help diagnose the root cause? I suspect it may be an internal bug in ShortStack's alignment selection logic. Your guidance on resolving this issue would be greatly appreciated. Please let me know if any additional info could help troubleshoot. Thanks in advance for your assistance! Best regards, zhongxiang

MikeAxtell commented 1 year ago

Thanks,

Can you tell me the version of ShortStack you are using?

Also, can you get in touch with me via email @.**@.>) and share your data so I can attempt to reproduce the error?

Mike

From: Xiang @.> Date: Monday, July 31, 2023 at 8:44 AM To: MikeAxtell/ShortStack @.> Cc: Subscribed @.***> Subject: [MikeAxtell/ShortStack] "IndexError when selecting alignment in al_pass_2 in ShortStack" (Issue #136)

Dear ShortStack developers, I encountered an error while using ShortStack to analyze small RNA data. The error seems to occur during the second round of read alignment (al_pass_2). Specifically, I got the following IndexError when ShortStack tried to randomly select an alignment from the alignments list:

reads processed: 15392282 reads with at least one alignment: 15181845 (98.63%) reads that failed to align: 210437 (1.37%)

Reported 266248020 alignments multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/bin/ShortStack", line 925, in al_pass_2 lib_counters = al_block_process2(lib_counters, File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/bin/ShortStack", line 1160, in al_block_process2 chosen = random.choices(alignments, weights=probs)[0] File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/lib/python3.10/random.py", line 533, in choices total = cum_weights[-1] + 0.0 # convert to float IndexError: list index out of range """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/bin/ShortStack", line 3582, in merged_bam, read_count = align(args, fai, trimmed_files) File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/bin/ShortStack", line 705, in align rs2_results = pool.starmap(al_pass_2, File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/lib/python3.10/multiprocessing/pool.py", line 375, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/lustre/home/suzhongxiang/Program/miniconda/miniconda3/envs/ShortStack/lib/python3.10/multiprocessing/pool.py", line 774, in get raise self._value IndexError: list index out of range

This appears to be caused by an empty cum_weights list, leading to incorrect indexing into alignments. I have tried adjusting parameters to get more candidate alignments, reducing thread counts, and updating to the latest ShortStack version, but the problem persists. Could you please help diagnose the root cause? I suspect it may be an internal bug in ShortStack's alignment selection logic. Your guidance on resolving this issue would be greatly appreciated. Please let me know if any additional info could help troubleshoot. Thanks in advance for your assistance! Best regards, zhongxiang

— Reply to this email directly, view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/136, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABUJPCKZBZWKJIEYI3JGISLXS6SBHANCNFSM6AAAAAA26GPDOA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

MikeAxtell commented 1 year ago

I have reproduced the error and I am starting to work on a fix.

MikeAxtell commented 1 year ago

I have identified the problem.

TL;DR: Quick work-around is to re-format your input FASTA reads like so:

cat GeJt3_G5.clean.fa | sed 's/^>@/>/' > GeJt3_G5.clean.noAT.fa

Your input reads are in the FASTA format. All of your FASTA headers begin with >@. For example: >@A01426:575:HVMF2DRX2:2:2101:3875:1000 1:N:0:CGAACTTA. When such reads are aligned and then written in the SAM format, each SAM alignment line begins with the @ character, which is the first character of your FASTA-encoded read names. In the SAM format spec, lines that begin with @HD, @SQ, @RG, @PG, or @CO are header lines, not alignment lines. ShortStack 4.0.2 has some lazy code that finds SAM headers lines by finding those start with @. Consequently, ShortStack thinks every line in the SAM file is a header line .. that is the source of the error.

Now, it is true that the FASTA spec does not forbid @ to be the first character of a sequence name. And technically, SAM headers are defined as starting with @HD, @SQ, @RG, @PG, or @CO, not just plain @ by itself. But, it is risky to have your reads named with all of them starting with the @ character.

Anyway, with the quick clean-up of your FASTA data using the one-liner above, you will be able to use the current ShortStack. I will work on a commit that cleans up the lazy code a bit in the meantime.

MikeAxtell commented 1 year ago

An update: I can make changes in the ShortStack code, but there are other issues that prevent the use of FASTA files where the header lines begin with >@. For instance, samtools will not convert a SAM file where the read names begin with @ into the BAM format.

So, this isn't quite a bug in ShortStack, it is mal-formed input data. Just avoid using FASTA files for input reads where the header lines begin with >@. You can just re-format your input data so the lines start with > instead of >@ and all will be fine.

zhongxiangSu commented 1 year ago

Okay! I understand now, thank you very much for your insightful answers.