Parallelize by chromosome

marcelm commented 4 years ago

Hi, parallelization by chromosome is now almost ready. The plots are missing/not working, but the rest should be fine.

Trimming reads and mapping is unchanged, but then the sorted BAM file is split into "chunks". A chunk consists of at least one full chromosomes, but more are added if the total length does not exceed 20 Mbp. This means that the standard chromosomes chr1 to chr22, chrX and chrY are processed as individual chunks, but then there is one extra chunk for chrM and all the rest (unplaced scaffolds etc). This way, we don’t create unnecessarily many jobs. The name of a chunk is the name of its first chromosome. That is, the "chrM" chunk actually contains chrM and all the subsequent ones. It should not be too hard to extend this mechanism to split at centromers (but then the name needs to include a starting coordinate).

I added a hidden chunk_size parameter to the configuration, which allows us to easily set it to 50_000 during the tests. (Creating two chunks for the test data.)

The chunks are computed from the reference FASTA index file at Snakefile evaluation time. That means that the .fai file cannot be computed as part of the pipeline, and the user has to run samtools faidx manually beforehand. This simplifies the rules a lot and is a price worth paying IMO.

The split BAM files are written to chunks/{chunkname}.sorted.bam, and from there the other steps are done as before (tagbam, clustrrmdup, ..., variant calling, BQSR, phasing etc.).

Finally, both the finished BAMs and the phased VCFs are merged into final.bam and final.phased.vcf.

I cannot do any more today. I hope to be able to fix the plot issue next week, but since I am mostly on vacation, perhaps I need some help with that.

I’ve just started a test run on HG002, though, and I am curious to see how that goes.

Things still missing:

[x] Plots
[x] Transfer of originally unmapped reads to output BAM(s)

marcelm commented 4 years ago

The test run crashed because of bugs in HapCUT2. Weirdly enough, they didn’t get triggered when running all chromosomes at once. I’m re-running with HapCUT2 from master, which apparently contains fixes.

pontushojer commented 4 years ago

I have done two commits.

The first enables plotting by concatenating the TSV files from the chunks. I had to change to MoleculeID value to ensure that these are unique for each chunk.

The next commit fixes an issue where ln -rs was used to generate links in the chunks folder. Unfortunately the -r/--relative option is not available in osx, therefor i exchanged it for a python function with the same functionallity.

pontushojer commented 4 years ago

@marcelm So I reverted parts of my earlier edits but decided to still create the concatenated TSV as for me it is quite useful. I modified plots.py to allow for multiple TSV files with molecule stats but in the Snakemake plot_figures rule the concatenated file is still used.

marcelm commented 4 years ago

Hm, I’m not sure if I’ve done this correctly because it looks almost too good, but it appears that HG002 now finishes in less than 24 hours (compared to 3 days before): rackham-snic2019-3-583-marcel-14608438 And there seems to be still some room for improvement because the CPU utilization is actually still quite low towards the end.

marcelm commented 4 years ago

CPU usage goes down about four hours before the pipeline finishes. During that time, the pipeline runs the final steps (mostly running the phasing part of HapCUT2) for chr1, chr2 and chr3. This is expected as these are the biggest chunks. So now our speed is essentially limited by how long it takes to run all the steps for chr1.

We could reduce wall-clock runtime by a couple more hours by making the chunks smaller (cutting at centromeres as suggested by @FrickTobias), but that should be a separate PR, and it is less urgent at the moment.

I will see to fixing the MD5 sum problem and then mark this PR as ready for review.

marcelm commented 4 years ago

The test didn’t start because the NBISweden organization has now run out of testing minutes. I’ve run them locally and they pass.

@pontushojer If you run the tests locally on macOS and they fail because of MD5 sum mismatches, please ignore those. I am going to suggest to remove the MD5 sum checks anyway as they are too brittle.

marcelm commented 4 years ago

One regression remains: The final.bam file does not contain any unmapped reads, which in turn means that the 'reads.[12].final.fastq.gz` files do not contain them.

pontushojer commented 4 years ago

That looks awesome! Fantastic that it finishes within one day :) I will run the test on macOS. .

I came to think about two possible issues that I am thinking about how to handle at the moment. They could be handled in a separate PR but I though I could just lift them here.

1. Merging clusters with shared duplicates The clusterrmdup step merges barcodes with shared duplicates. To do this it reads in the BAM twice, first to find the duplicates and then to re-tag the reads with merged barcodes. In the parallel pipe this is done separately for each chunk. The means that reads sharing barcode might be tagged differently depending on which chunk they are on.

Some problems I could think of related to this are:

Lower read density per barcode --> Less hetSNVs covered for phasing.
Changed statistics e.g. reads per barcode, molecules per barcode, molecule length.

One solution for this would be to read in the chunks once and collect a data on barcodes with shared duplicates. This data could then be combined for all chunks and used for re-tagging reads in the same way. This would however mean that we need to wait until data from all chunks have been collected before progressing.

2. Filtering out barcodes with high molecule counts Here there is a similar problem. We want to filter out barcode clusters with too many called molecules as these might have overlapping molecules. To do this each mapped read is tagged in the buildmolecules step with a MN:i tag with the number of molecules found for that barcode. Then in the filterclusters step barcodes with number of molecules above the threshold of 260 st molecules. When this is done for chunks the number of molecules is not aggregated, thus the filtration has to be done differently.

One solution would be to scale this threshold relative to the size each chunk. This would however not filter out barcode on other chunks that we suspect to contain overlapping molecules.

An other solution would be similar to my solution for the first issue. Here we collect data for all chunks that we then aggregate an use to filter out barcodes. The issue is here again that this would mean that we need to wait for all chunks to finish before we can do aggregation and filtration.

What do you think? Input from @FrickTobias would also be good.

pontushojer commented 4 years ago

I run the tests on my mac and I get the md5sums that were prior to your change in commit e4f2882. Clearly there is a OS difference, lets just remove these checks as they are more confusing than useful.

marcelm commented 4 years ago

How is a shared duplicate defined?

pontushojer commented 4 years ago

How is a shared duplicate defined?

Two positions (positions defined as a unique set of the read pair span (start, stop) and direction) at a maximum of W (window, default 100 kbp) bp apart sharing more than one barcode.

FrickTobias commented 4 years ago

I think this is fine for now, in the interest of time.

However Pontus is correct in that both clusterrmdup and filtermolecules (might be buildmolecules, the one which sets the number of molecules tag anyway) require results from all chromosomes. I propose we accept this as is and create a new issue regarding this. Only BLR will be affected by this, as it deals with shaking emulsion specific artifacts.

Since we now handle multiple platforms it might also be good to add some conditions for skipping filtermolecules and clusterrmdup whenever running a non-BLR method.

The way forward for BLR is probably to rewrite these two steps. An idea would be to split up their functions so the first step reads a bam file and writes results to an outfile (rather than saving to dict) and letting a separate function compile a summarized result file after which individual chromosome processing can proceed. This would lead to some waiting for the smaller chromosomes, but it should not increase the total time from start to finish that much I think.

TL;DR: Looks good, merge now and make the following issues

don't run clusterrmdup and filtermolecules on non-BLR data
Rewrite clusterrmdup and filtermolecules to fit chunks-based data

pontushojer commented 4 years ago

I will merge this PR as it seems to work quite well and most things that we wanted are integrated. This also allows for me and @FrickTobias to continue on the new issues FrickTobias/BLR#214 and FrickTobias/BLR#215 lifted in the discussion herein.

AfshinLab / BLR

Parallelize by chromosome #16