Clinical-Genomics / BALSAMIC

Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
https://balsamic.readthedocs.io/
MIT License
44 stars 16 forks source link

Parallelise alignment per lane instead of on concatenated fastqs #1077

Closed mathiasbio closed 1 year ago

mathiasbio commented 1 year ago

Is your feature request related to a problem? Please describe.

Decreasing the turn-around-time for these analyses are important. Typically a sequencing run produces 4 sets of fastq-pairs for one sample. Currently alignment is done on fastq-files where these 4 sets have been concatenated.

Depending on how busy the cluster is at the moment of analysis this means that in the worst case scenario alignment will take roughly 4x longer than necessary. Fortunately Sentieon bwa mem is already fast so in the end this would mean that instead of taking ~30 minutes it is now taking ~100 minutes to align a 120X WGS sample.

Perhaps more importantly for the turn-around is the associated issue: https://github.com/Clinical-Genomics/BALSAMIC/issues/1053 wherein it is suggested that implementing this parallel mapping where the RG can be assigned to each lane, would remove the need for many steps implemented in the mergeBam rules. These steps take a lot of time (20~ hours for a 120X tumor) and so if we can skip them that would improve turn around time considerably.

Finally, it is recommended by Sentieon to align and assign read-groups separately per lane. Personally I don't understand how this information is used by downstream tools but I can imagine it might be useful to collect information on a per lane basis for evaluating a variant-call. So perhaps implementing this will improve variant-calling as well.

Describe the solution you'd like

Smallest possible requirements I've identified to be able to do this is:

Current affected rule if snakemake workflow related

Primarily: sentieon_align_sort, sentieon_dedup, fastp, fastqc Secondary: downstream rules that might be affected by changes in read-group info. Thinking about somalier for instance.

Current BALSAMIC version balsamic --version

mathiasbio commented 1 year ago

Unsure of how to prioritise this issue. It seems to depend a bit on how quickly it can be implemented into production. If it takes too long, it might be better to prioritise this issue first: https://github.com/Clinical-Genomics/BALSAMIC/issues/1053 just to speed up the analysis, and then begin working on the longterm fix.

pbiology commented 1 year ago

This feel like it overlaps with #1109. Should we close one of these?

mathiasbio commented 1 year ago

It does! I renamed the feature issue to make it clearer that it did not involve only alignment per lane, as the PR naturally expanded to make a few other changes. But maybe it makes more sense to just keep this issue and remove https://github.com/Clinical-Genomics/BALSAMIC/issues/1109 ?

pbiology commented 1 year ago

I think this is the request and then it is refined in #1109.

I think #1109 should be were we keep track of all PRs and solutions we add to add the feature (that way we can also try to keep our PRs small and on scope). And in that feature issue we should also keep track of the requests which lead to the feature being developed (meaning this issue).

In fact, perhaps the feature issue should include a section about which feature-requests are solved by it.

mathiasbio commented 1 year ago

Closing this as it has been refined in https://github.com/Clinical-Genomics/BALSAMIC/issues/1109