This is not a question about your scripts but about the original miRDP2 pipeline, but maybe you can help. It is also maybe a stupid question based on my own inability to follow the pipeline scripts.
The pipeline runs the script preprocess_reads-SAM.pl (or preprocess_reads.pl if you input FASTA files). That script applies an RPM filter to reads that align to known miRNAs:
# filter reads by sequence similarity to miRNA/ncRNA & reads rpm
foreach my $id(sort keys %reads_hash)
{
#print all non-ncRNA reads for signature preparaion
print PROCESSED ">$id\n$reads_hash{$id}\n";
#retrieve reads match known miRNA mature seq and exceeded rpm threshold
$id=~/_x(\d+)/;
if($miR_hash{$id} eq "T" || $1/$total_reads*1000000 >= $threshold){print FILTERED ">$id\n$reads_hash{$id}\n";}
}
print TOTALREADS "$total_reads";
This is confusing me. Is this only outputting reads that match to known miRNAs and meet the RPM threshold to FILTERED?` What about all the reads that do not match to known miRNAs?
Following through the pipeline scripts it looks like only the reads output to FILTERED are used to extract the precursor sequences for subsequent prediction (using PROCESSED would make more sense to me). I also do not understand how using non-redundant FASTA files as input would ever work as each known miRNA could then only ever have one perfectly matching read.
Lastly, doesn't this mean that if you use fastq input (which is redundant) that a precursor is created for each output read? Doesn't that create a huge amount of redundancy in the precursor folding as a precursor will be created for all reads that align and there could be many reads mapping to the same genomic position?
I am clearly missing something obvious (lack of perl knowledge most likely) and would really appreciated it you can point out what my mistake in reading this is!
This is not a question about your scripts but about the original miRDP2 pipeline, but maybe you can help. It is also maybe a stupid question based on my own inability to follow the pipeline scripts.
The pipeline runs the script
preprocess_reads-SAM.pl
(orpreprocess_reads.pl
if you input FASTA files). That script applies an RPM filter to reads that align to known miRNAs:This is confusing me. Is this only outputting reads that match to known miRNAs and meet the RPM threshold to
FILTERED
?` What about all the reads that do not match to known miRNAs?Following through the pipeline scripts it looks like only the reads output to
FILTERED
are used to extract the precursor sequences for subsequent prediction (usingPROCESSED
would make more sense to me). I also do not understand how using non-redundant FASTA files as input would ever work as each known miRNA could then only ever have one perfectly matching read.Lastly, doesn't this mean that if you use fastq input (which is redundant) that a precursor is created for each output read? Doesn't that create a huge amount of redundancy in the precursor folding as a precursor will be created for all reads that align and there could be many reads mapping to the same genomic position?
I am clearly missing something obvious (lack of perl knowledge most likely) and would really appreciated it you can point out what my mistake in reading this is!