elsasserlab / minute

MINUTE-ChIP data analysis workflow
https://minute.readthedocs.io
MIT License
2 stars 0 forks source link

BigWig computation rules are slow #89

Open marcelm opened 4 years ago

marcelm commented 4 years ago

As discussed in #80.

The scaled_bigwig and unscaled_bigwig rules are responsible for a good chunk of the total runtime of the pipeline. It is possible that bamCoverage is quite inefficient for our case and we may want to look into alternatives. For example, it needs 100 CPU minutes to compute a 69 MB BigWig file from a 850 MB BAM file (samtools view needs 16 seconds to iterate over the file). This takes longer than the actual read mapping. I believe that this is slower than it needs to be and that there is some potential.

Especially when more bin sizes are added, this will definitely be a problem.

cnluzon commented 3 years ago

A couple of deepTools issues discussing performance. I am just posting this as a reminder that I actually looked at this, with negative results :_)

1) Here they seem to come to the conclusion that --normalize RPGC is slowing things down, as get_scale_factor method takes a lot of the execution time (they argue).

2) Here they also complain that deepTools used to perform poorly on small BAM files because it does some subsampling in order to compute scaling factor. Now they provide this --exactScaling parameter that just ignores this and uses all reads. I am not sure whether this could help, it depends on what they consider "big" and how much this estimation hinders the performance.

I don't think any of this is our case, the unscaled rules would take significantly (or at least perceptibly) longer than scaled, and according to the values in #80, this is not happening. Aggregated value accounts to more in the unscaled, but I think just because of more rules (+ Input). Individual values don't seem to differ.

Anyway I looked at the code and I see that RPKM only uses a formula and for RPGC what they do is: go to the BAM file to calculate fragment length (which makes use of a mapReduce function, but iterates while not enough values are found, so scarce coverage BAM files could be a problem, but doesn't seem to be problematic for us). Then they use: 1.0 / (float(bam_mapped * fragment_length) / args.effectiveGenomeSize) as a scaling factor.

At this point of the pipeline execution, we already calculated all those values, so we could basically feed that composite scaling factor instead of the one we use right now. I don't think this would help a lot in an individual bigWig file. We do have a large amount of bigWig files, so maybe it would shave off some minutes per file. But I'm not sure this is worth to pursue. I did a mini test on a ~500MB BAM file it reduced about 10 seconds execution time from a total of 4 minutes (on a single thread, --binSize 50).

--binSize has a large influence on this. So we can also consider how much resolution we need on the other bigWig files generated.

marcelm commented 3 years ago

And for completeness, I want to mention https://github.com/deeptools/deepTools/issues/912, where I reported that bamCoverage is slow on BAM files with very few reads.