ccagc / QDNAseq

QDNAseq package for Bioconductor
47 stars 27 forks source link

Parallelize binReadCounts by chromosome when chunkSize is used #44

Closed sambrightman closed 7 years ago

sambrightman commented 7 years ago

This functionality is on by default, with the default BiocParallel environment for the host machine. Often this means process-based parallelism with number of workers equal to number of cores minus two.

BiocParallel allows client code to configure the parallel environment without explicitly passing any parameters into binReadCounts, so changing this default can be done per the documentation:

register(MulticoreParam(workers=8))
readCounts <- binReadCounts(bins, bams, chunkSize=TRUE)

There are further opportunities for parallelism: at the chunk level, the file level and also in the non-chunked version. File-level parallelism is rather easy to arrange outside of QDNAseq. Chunk-level parallelism currently seems unnecessary when already parallelizing by chromosome (especially since chunking by entire chromosomes fits in a few GB of memory).

sambrightman commented 7 years ago

Fixed the DESCRIPTION file so R CMD check passes. Didn't know which version to put - used the current version I get from Bioconductor.

daoud-sie commented 7 years ago

I have checked the data generated by this code and it all checks out perfectly.

Sorry for the long wait.

sambrightman commented 7 years ago

Great! Any chance of a tag/release with this?