Synchronise merges - Githubissues

pontushojer commented 4 years ago

Fix for https://github.com/FrickTobias/BLR/issues/214

Split clusterrmdup.py in two files:

find_clusterdups.py to find cluster duplicates
merge_clusterdups.py to correct barcodes identified as duplicates.

I implemented a graph based handling of connected barcodes (i.e. clusters) using class BarcodeGraph in place of the previously used merge_dict. This enables simpler merging of different graphs for synchronisation and is, at least to me, easier to follow.

Updated workflow shown below.

rule find_clusterdups identifies cluster (barcode) duplicates in each chunk and outputs a pickle file (a dict) containing the duplicates. Each key in the dict is a barcode barcode and the value is a set of connected barcodes.
rule get_barcode_merges takes in the pickle files from all chunks and merges the dicts to synchronise between chunks. It outputs a human-readable CSV final.barcode-merges.csv in the same format as previously outputted by clusterrmdup.py i.e. <current_barcode>,<new_barcode for each barcode to update.
rule merge_clusterdups takes in an corrects each barcode based on the CSV file.

Previosly	This PR

pontushojer commented 4 years ago

I am currently running tests on UPPMAX

pontushojer commented 4 years ago

Picking up on this after some time.

I have run this twice on uppmax. There is some time loss due to the new bottleneck introduced here, but it is not substantial. Below is the output of on row starting from the mapped reads (if failed later on but this was unrelated to the changes introduced here). The find_clusterdups step starts around 40 min into the run and ends at about 200 where the get_barcode_merges step is. During this time we see a gradual reduction in the number of active cores as expected.

I compared runtime looking at chr1 and for the old clusterrmdup step this took about 2h. With this PR the find_clusterdups step take about 1h 40min for chr1 and then 15min for the merge_clusterdups step. So about the same time.

One surprising thing I found was that chrY took about 3 h to complete the same steps, this was also the case for the old clusterrmdup step. I am unsure what accounts for this long runtime, reads and coverage is much lower for this chromosome. I will setup a new issue with this for continued investigation. It is however not relevant for this PR.

AfshinLab / BLR

Synchronise merges #30