Reducing redundancy of reads

pgonzale60 commented 7 years ago

Hi Mike,

I've been using ShortStack in a couple of projects and it has been a very useful and easy to use tool in order to assign multimapping reads and to annotate non-coding regions with small RNA-seq data. However, I think it could be improved in matter of resources by reducing redundancy from the input reads. There are tools like tally that summarize the reads. I think ShortStack can make use of that by only altering the way it counts the density, taking into account that a read is actually a representative of x number of reads.

Fore example, I ran tally. It gave me fasta files with identifiers >trn${uniqueID}${numberOfReadsItRepresents}. Then, I substitute ++$$density{$key} in: `else {

A unique mapper

            $read_bucket[0] .= "\tXX:i:1";
            ++$counts{'Uq'};
            ++$sum;
            ## Add to density hash, unless run under --none or --random
            unless(($$options{'mmap'} eq 'n') or
               ($$options{'mmap'} eq 'r')) {
            $bin = int ($left_pos / 50);
            $key = "$chr" . ":" . "$bin";
            ++$$density{$key};

}`

to

my ($number) = $read_bucket[0] =~ /.*trn_\d+_(\d+).*/; $$density{$key} += $number;

Bests, Pablo

MikeAxtell commented 7 years ago

Thanks Pablo. It's great to hear that you've found ShortStack useful and easy to use.

You are absolutely right; collapsing reads would for sure speed mapping, and there are ways, like the one you pointed out, to integrate collapsed reads into the density hash used for placements. Believe me, I've considered this before. But there are several other issues:

I've deliberately designed ShortStack to both accept and produce standard BAM files. People generally expect that BAM files have a one read - one SAM line (per placement) relationship. Yes, we could code into the optional tags a custom optional TAG that indicates frequency of the sequence, but that would break a lot of downstream things that people do with BAM files. Such as using them on a genome browser to show coverage plots.
The internal workings of how ShortStack finds clusters of expression also depends on the one read - one SAM line assumption. Would require some re-working of that code.

This is not to say it couldn't be done; it could, and as I write this I can think of work-arounds for both of the issues above .. for issue 1 I could output a both a 'reduced' BAM file and a 'full' BAM file. For issue 2 I would have to buckle down and re-work all of the relevant code (there's a lot of it, actually). But do-able.

I'll mark this as an enhancement, and when I have time sometime I will play with it.

Thanks again.

MikeAxtell commented 6 years ago

I'm going over old issues that are still open, including this one. After thinking about it I am marking this as a 'wontfix' for now. I really do want to keep the bam files that ShortStack creates and operates on standard.

MikeAxtell / ShortStack

Reducing redundancy of reads #60

A unique mapper