TF-Chan-Lab / miRDeep-P2_pipeline

GNU General Public License v3.0
5 stars 1 forks source link

The predicted miRNAs have no counts during alignment #6

Open lixiaoling123 opened 9 months ago

lixiaoling123 commented 9 months ago

Thank you very much for developing the tool. I have successfully analyzed my data using the miRDeep-P2 pipeline. However, I have encountered some issues and would like to seek your advice:

When I used the miRDP2-v1.1.4_pipeline.bash to predict miRNAs for 20 different processed samples, based on sample1.reads_filter_P_prediction, the number of predicted miRNAs for each sample is approximately 250-400. After parsing the predictions for all 20 samples using parse_miRDP2_prediction.pl, I obtained 2511 miRNA sequences in the miRDP2_mature.fa file. I am quite puzzled about why so many miRNAs were predicted. Additionally, when I ran bam2ref_counts.pl and combine_htseq_counts.pl, the count_table.txt file showed that approximately 1500 miRNAs across all samples had counts of 0. I am unsure whether these miRNAs with zero expression counts are genuine.

Furthermore, I observed that when I concatenated sRNA data from all samples together using "cat sample1.fa sample2.fa ... > all.sample.fa" and used this merged data for miRDP2-v1.1.4_pipeline.bash prediction, according to all.sample_filter_P_prediction, only about 300 miRNAs were predicted. Is it more appropriate to merge all sequencing data together and then predict miRNAs?

alanlamsiu commented 9 months ago

@lixiaoling123 According to the user manual of miRDP2, the filter_P_prediction is the final output of predicted miRNAs. The _predictions file used in this pipeline is a file before some steps, including filtering, from the final output. This pipeline is greedy in getting all predicted sequences and they can be filtered out in later steps using counts and annotation. Those predicted sequences having 0 count in all samples should be filtered out.

Regarding whether to combine read data from multiple samples for miRDP2, I personally would not suggest to do so. For one reason, how reads piling up in a genomic location can be quite different when using individual samples and multiple sample combined. My understanding is that miRDP2 considers read alignment for miRNA prediction. Combining multiple samples may complicate the prediction process.