TF-Chan-Lab / miRDeep-P2_pipeline

GNU General Public License v3.0
5 stars 1 forks source link

error while running the miRDP2 pipeline bash script #5

Open ramandeep2903 opened 9 months ago

ramandeep2903 commented 9 months ago

Hi, I have been receiving the following error while executing miRDP2-v1.1.4_pipeline.bash even after formatting the input fasta files:

Use of uninitialized value $id in hash element at bin/scripts/convert_bowtie_to_blast.pl line 89. Use of uninitialized value $id in hash element at bin/scripts/convert_bowtie_to_blast.pl line 89. Use of uninitialized value $id in hash element at bin/scripts/filter_alignments.pl line 195. Use of uninitialized value $id in hash element at bin/scripts/convert_bowtie_to_blast.pl line 89. Use of uninitialized value $id in hash element at bin/scripts/convert_bowtie_to_blast.pl line 89. Use of uninitialized value $id in hash element at bin/scripts/mod-miRDP.pl line 1222. Use of uninitialized value $id in hash element at bin/scripts/mod-miRDP.pl line 1223. Use of uninitialized value $id in hash element at bin/scripts/mod-miRDP.pl line 1224. Use of uninitialized value $id in hash element at bin/scripts/mod-miRDP.pl line 1225. Use of uninitialized value $subject_old in hash element at bin/scripts/mod-miRDP.pl line 688. Use of uninitialized value $lng in numeric le (<=) at bin/scripts/mod-miRDP.pl line 694. Use of uninitialized value $subject_old in hash element at bin/scripts/mod-miRDP.pl line 789. Use of uninitialized value $subject_old in hash element at bin/scripts/mod-miRDP.pl line 790. Use of uninitialized value $subject_old in hash element at bin/scripts/mod-miRDP.pl line 791. Use of uninitialized value $query in hash element at bin/scripts/mod-miRDP.pl line 1046. Use of uninitialized value $query in hash element at bin/scripts/mod-miRDP.pl line 1047. Use of uninitialized value $query in hash element at bin/scripts/mod-miRDP.pl line 1036. Use of uninitialized value $end in numeric le (<=) at bin/scripts/mod-miRDP.pl line 1090. Use of uninitialized value $beg in numeric le (<=) at bin/scripts/mod-miRDP.pl line 1090. Use of uninitialized value in numeric le (<=) at bin/scripts/mod-miRDP.pl line 1094.

Here is how my input fasta files looks like:

read18361379_x0 CCAAATGCCTCGTCATCTAATTAGTAACTGTAGGCACCATCAATATGACA read15801861_x0 TGCCCGGAACTGTAGGCACCATCAATCACAACCAACTCAGATCGGAAGAG read4496439_x1 AAAGATGAAAAGGACTTTGAACTGTAGGCACCATCAATACCAGCCACCTG read5177978_x0 GCCGGGAACTGTAGGCACCATCAATAAGGCTCTACTGAGATCGGAAGAGC

alanlamsiu commented 9 months ago

Hi @ramandeep2903,

Could you share the exact command line you are using? If possible, could you also upload a copy of the input file here?

ramandeep2903 commented 9 months ago

Here is the command that I used:

bin/miRDP2-v1.1.4_pipeline.bash -g V2genome/V2genome.fasta -x V2genome/V2genome.fa -q -b fastq_list.txt -o result_lane1

I uploaded the input file on dropbox folder.

alanlamsiu commented 9 months ago

The input file look fine. I think it is an issue that the name of the genome FASTA does not match the index. Please try to re-build the index using the following.

bowtie-build V2genome.fasta V2genome.fasta

The prefix of the genome index should be the same as the genome file name.

ramandeep2903 commented 9 months ago

I did what you said but I am receiving the same error.

alanlamsiu commented 9 months ago

I see. There could be some clues in the intermediate files. Could you also share any intermediates files generated during your run using the same Dropbox folder, including the reference fasta file?

ramandeep2903 commented 9 months ago

I uploaded the reference fasta file and bowtie indexed files to the dropbox. Can we take the further conversation on email?

alanlamsiu commented 9 months ago

I think the files look fine. You may need to use the -f option instead of -q, because your input is in .fa format. If your input is in .fq format, you can you -q.

ramandeep2903 commented 9 months ago

I am receiving the same error even after changing the code as you mentioned. I am also error in script_err file as follows:

Warning: Could not open read file "/gpfs/scratch/ramandeep.kaur/Anne/microRNA/lane1/trimmed/text_files/61S-1_input.fasta " for reading; skipping... Command: bowtie -v 0 -f bin/scripts/index/rfam_index /gpfs/scratch/ramandeep.kaur/Anne/microRNA/lane1/trimmed/text_files/61S-1_input.fasta

Warning: Could not open read file "/gpfs/scratch/ramandeep.kaur/Anne/microRNA/lane1/trimmed/text_files/61S-1_input.fasta " for reading; skipping... Command: bowtie -v 1 -f bin/scripts/index/mature_index /gpfs/scratch/ramandeep.kaur/Anne/microRNA/lane1/trimmed/text_files/61S-1_input.fasta

Warning: Could not find any reads in "result_lane1//-processed.fa"

reads processed: 0

reads with at least one reported alignment: 0 (0.00%)

reads that failed to align: 0 (0.00%)

No alignments Warning: Could not find any reads in "result_lane1//.fa"

reads processed: 0

reads with at least one reported alignment: 0 (0.00%)

reads that failed to align: 0 (0.00%)

No alignments Warning: Empty input file Error: No unambiguous stretches of characters in the input. Aborting... Command: bowtie-build -f result_lane1//_precursors.fa result_lane1//index/_precursors Could not locate a Bowtie index corresponding to basename "result_lane1//index/_precursors" Command: bowtie -a -v 0 -f result_lane1//index/_precursors result_lane1//_filtered.fa

alanlamsiu commented 9 months ago

It seems that your file /gpfs/scratch/ramandeep.kaur/Anne/microRNA/lane1/trimmed/text_files/61S-1_input.fasta is not accessible. I tested on the small file you provided earlier and there was no error. I can give it another test if you could upload another set of files, inlcuding the reference and input data.

ramandeep2903 commented 9 months ago

Sure, I uploaded another set of files with 3 input files including a reference file and indexed files. Please let me know if script works with you.

alanlamsiu commented 9 months ago

I did some tests using the files provided by miRDP2 and you. In the same Dropbox folder, you will see the GSM2094927 folder, which contains the outputs for the test data, suggesting miRDP2 is working properly. In the folders, 61S-1_input, 64S-2_input and 64S-3_input, correspoding to your data, you will see in the script_err file that the alignment rates are extremely low, leading to empty outputs in most of the files. I suspect that the read data and the reference genome are not from the same species. You may need to figure out whether there is any issues in the read data or reference genome.

PessieXiong commented 4 months ago

"Hi @alanlamsiu, Could you help me interpret my results? Is it possible to have so many reads failed?" Setting the index via positional argument will be deprecated in a future release. Please use -x option instead.

reads processed: 18856695

reads with at least one alignment: 1638833 (8.69%)

reads that failed to align: 17217862 (91.31%)

Reported 1638833 alignments

alanlamsiu commented 4 months ago

Hi @PessieXiong,

Could you provide more information for getting the alignment result? What reference you did you use to align the reads? How did you generate the reference? Did you align the reads to the genome and check the mapping rate? It will be helpful if you could provide as much information as possible.

PessieXiong commented 4 months ago

Hi @alanlamsiu, I've executed the following commands: bowtie-build miRDP2_mature.fa miRNA_ref.fa bowtie -v 0 --norc -S miRNA_ref.fa "/home/***/data/srnadata/SRR19568902.fq.gz"| samtools view -Sb - > SRR19568902.bam However, the mapping rate appears to be considerably low:

reads processed: 18856695

reads with at least one reported alignment: 1535784 (8.14%)

reads that failed to align: 17320911 (91.86%)

Reported 1535784 alignments Upon reviewing your instructions in step 3, where you mentioned, "In step 1, a miRNA sequences file, miRDP2_mature.fa, is generated. This file can be used as a reference for mapping," I opted to utilize miRDP2_mature.fa as the reference. Is my understanding correct?

Additionally, I have a question: In Step 2, you outlined that the result is divided into three categories. The provided codes can generate known and variant ones. For instance, in miRDP2_mature.fa, there are 678 records, with miRDP2_known: 43, miRDP2_variant: 160. Does this imply that the number of novel miRNAs equals 475?

Furthermore, could you clarify the differences between the single sample result suffix _filter_P_predictions and the combined result miRDP2_mature.fa, which represents the final predicted result?

alanlamsiu commented 4 months ago

Hi @PessieXiong,

The low mapping rate you've observed here can be related to multiple aspects. miRDeep-P2 is used for predicting miRNAs. Yet, there could be other small RNAs that are not predicted as miRNAs, e.g. they don't show the hairpin structure. There are two things you can check. First, you can compare the length distribution of the mapped and unmapped reads, assuming all reads have been properly trimmed, e.g. with adapters removed. You should be able to see the length distribution of those mapped reads falling into 19 - 25 nt, while those unmapped ones can be beyond this range. You can also mapped those unmapped reads to the genome, assuming the genome sequence is available. If the mappin rate is high, it is likely that miRDeep-P2 didn't predict them as miRNAs.

For your second point, you can regard those that are not known or variant miRNAs as novel.

For your last point, I would like to point out that according to instructions from the miRDeep-P2 manuscript written by the developers, the _filterP prediction file is the final output that is usable for downstream analyses. In this pipeline, we take a step back and use the _predictions, which is a greedy move to obtain a larger list of sequences to work with.