MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
89 stars 29 forks source link

IndexError: list index out of range #130

Closed meerveld96 closed 1 year ago

meerveld96 commented 1 year ago

Hi,

I run ShortStack ( 4.0.1) with this command: ./ShortStack/ShortStack --genomefile ../G_pallida_Rookmaker_RH89-039-16_potato_genomes_combined.fasta --readfile Seresta_Rook_1_adapter_removed.fastq --knownRNAs ../forward_1/predicted_smallrnas.fasta --threads 100 --dn_mirna --dicermax 26 --outdir Seresta_Rook > Seresta_Rook.log

But I got this error, see below file:

error.txt

Thanks for helping me out.

Best regards, Stefan

MikeAxtell commented 1 year ago

Can you:

  1. Run the test / example run detailed in the README, exactly as described. Does it work on your system?
  2. Send the full stdout/stderr from your failed run (what you posted was just a snippet). You can redirect stdout and stderr to a file, or run on a non-TTY, to not have to print the progress bar characters.
meerveld96 commented 1 year ago
  1. I was able to run the test / example run without no errors, please correct me if I'm wrong: alignment_details.txt
  2. Sorry, I forgot the log file from ShortStack itself: ShortStack.log

I was able to run ShortStack on an another sample: Complete_ShortStack.log

Please let me know if you need additional information.

MikeAxtell commented 1 year ago

Thanks, it's a bit of a puzzle. Your problematic run aborted at a step where it is parsing predicted RNA secondary structures. The specific failure is that it received an empty line from an RNAfold call where it should have retrieved a structure.

I noticed that you are using an extreme number of --threads. It's just a guess, but there might be some issues with communication across nodes (assumming you are using more than one node on a cluster if you are grabbing 100 threads!).

Can you try to restrict to a single node, and a more reasonable number of threads (say 10 or so?).

Another work around is to not perform MIRNA identification (omit the --knownRNAs option and do not set the --dn_mirna option). Although that is not great if you actually want ShortStack to annotate MIRNA loci for you.

If that fails, I will ask you to share your genome and fastq so I can try to reproduce the error on my end.

meerveld96 commented 1 year ago

Thanks for your suggestions, the reason why I did 100 threads is because of computational time, it took then already a working day to be finished. But I can first lower the amount of threads (to 5). For omitting the --knownRNAs option is not ideal in our situation.

MikeAxtell commented 1 year ago

ShortStack's read alignment phase is the most time / cpu-intensive, because of the treatment of multi-mapping reads. In your case, your initial run should have completed read alignment successfully. You can retrieve the .bam file (and its index) that was made in the failed run, and use it as input to the --bamfile option when testing out. That will save some time.

From: meerveld96 @.> Date: Tuesday, May 2, 2023 at 2:49 AM To: MikeAxtell/ShortStack @.> Cc: Axtell, Michael @.>, Comment @.> Subject: Re: [MikeAxtell/ShortStack] IndexError: list index out of range (Issue #130)

Thanks for your suggestions, the reason why I did 100 threads is because of computational time, it took then already a working day to be finished. But I can first lower the amount of threads. For omitting the --knownRNAs option is not ideal in our situation.

— Reply to this email directly, view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/130#issuecomment-1530970215, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABUJPCN7SZCRRPLSLAVGXODXECU7XANCNFSM6AAAAAAXRSNV7U. You are receiving this because you commented.Message ID: @.***>

meerveld96 commented 1 year ago

Ah thanks for the suggestion, for me to examined the candidates step takes the longest.

# reads processed: 9468
# reads with at least one alignment: 3649 (38.54%)
# reads that failed to align: 5819 (61.46%)
Reported 371793 alignments
[bam_sort_core] merging from 0 files and 5 in-memory blocks...
Candidates examined:  12%|██████▎                                                | 42775/371793 [4:00:33<138:31:45,  1.52s/it]
MikeAxtell commented 1 year ago

Is your reference genome highly fragmented -- in 100s or 1000s of contigs/scaffolds? Slowness at this step could be due, in part, to a poorly assembled genome.

Also, what is the source of your "knownRNAs"? Some of them must be highly repetitive .. I see you have 3649 of them aligned, but in total there are ~372 thousand hits. The "knownRNAs" are meant to be known microRNA sequences only. Part of the slowness is that you are searching many highly repetitive hits for microRNA-like characteristics. Consider trimming your "knownRNAs" file to include just mature miRNA sequences known from your species or a closely related species.

From: meerveld96 @.> Date: Tuesday, May 2, 2023 at 7:33 AM To: MikeAxtell/ShortStack @.> Cc: Axtell, Michael @.>, Comment @.> Subject: Re: [MikeAxtell/ShortStack] IndexError: list index out of range (Issue #130)

Ah thanks for the suggestion, for me to examined the candidates step takes the longest.

reads processed: 9468

reads with at least one alignment: 3649 (38.54%)

reads that failed to align: 5819 (61.46%)

Reported 371793 alignments

[bam_sort_core] merging from 0 files and 5 in-memory blocks...

Candidates examined: 12%|██████▎ | 42775/371793 [4:00:33<138:31:45, 1.52s/it]

— Reply to this email directly, view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/130#issuecomment-1531313314, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABUJPCM2JV6NPZVV3PNUNALXEDWHTANCNFSM6AAAAAAXRSNV7U. You are receiving this because you commented.Message ID: @.***>

meerveld96 commented 1 year ago

The reference genome is distributed over 3078 scaffolds. These consist of two genomes, one is a relatively good genome (54 scaffolds) and the other is a fragmented genome (the rest).

I first pooled all samples and did an initial ShortStack run to find small RNAs in general (they are not well known for these specific genomes), and gave the result (all sRNAs, because we are not only interested in miRNAs) as input for the --knownRNAs parameter per sample to the second ShortStack run which then failed with this IndexError.

It is of course possible that these sequences are very repetitive indeed, how to deal with this, do you have any advice? I understand that I can only give miRNAs to the knownRNAs parameter, but what about the rest of the smallRNAs we are interested in? I also want to predict them as accurately as possible.

MikeAxtell commented 1 year ago

I suggest using only known microRNAs from other closely related species as input to the 'knownRNAs' option. You can also enable the 'dn_mirna' switch to turn on de novo microRNA searches.

For "small RNAs in general", ShortStack finds all clusters in the genome where the sRNA abundance exceeds the mincov threshold. These will all be reported. Most expressed small RNAs, especially in plants, are siRNAs, not microRNAs. ShortStack will report the most abundant single RNA from each of these loci (in the Results.txt file).

The 'knownRNAs' option specifies known mature microRNA matches in the reference genome where ShortStack will look "hard" to check for the MIRNA criteria.

I wish I had given the 'knownRNAs' option a different name, like 'known_micrornas', to make this more clear.

From: meerveld96 @.> Date: Tuesday, May 2, 2023 at 9:05 AM To: MikeAxtell/ShortStack @.> Cc: Axtell, Michael @.>, Comment @.> Subject: Re: [MikeAxtell/ShortStack] IndexError: list index out of range (Issue #130)

The reference genome is distributed over 3078 scaffolds. These consist of two genomes, one is a relatively good genome (54 scaffolds) and the other is a fragmented genome (the rest).

I first pooled all samples and did an initial ShortStack run to find small RNAs in general (they are not well known for these specific genomes), and gave the result (all sRNAs, because we are not only interested in miRNAs) as input for the --knownRNAs parameter per sample to the second ShortStack run which then failed with this IndexError.

It is of course possible that these sequences are very repetitive indeed, how to deal with this, do you have any advice? I understand that I can only give miRNAs to the knownRNAs parameter, but what about the rest of the smallRNAs we are interested in? I also want to predict them as accurately as possible.

— Reply to this email directly, view it on GitHubhttps://github.com/MikeAxtell/ShortStack/issues/130#issuecomment-1531442734, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABUJPCL2K5PDIJHLFBCPM2DXEEBCVANCNFSM6AAAAAAXRSNV7U. You are receiving this because you commented.Message ID: @.***>

meerveld96 commented 1 year ago

Thanks, now I understand it better, I will try to feed ShortStack with known microRNAs from other closely related species or the one generated by ShortStack the first time when I pooled all the samples together, these are located in mir.fasta.

I do indeed work with plant material that has been infected with a pathogen.

Yes, I agree that changing the --known_RNAs parameter makes it more clear.

MikeAxtell commented 1 year ago

As of release 4.0.2 the option has been renamed to --known_miRNAs and the documentation improved.