Closed pontushojer closed 4 years ago
I am currently running tests on UPPMAX
Picking up on this after some time.
I have run this twice on uppmax. There is some time loss due to the new bottleneck introduced here, but it is not substantial. Below is the output of on row starting from the mapped reads (if failed later on but this was unrelated to the changes introduced here). The find_clusterdups
step starts around 40 min into the run and ends at about 200 where the get_barcode_merges
step is. During this time we see a gradual reduction in the number of active cores as expected.
I compared runtime looking at chr1 and for the old clusterrmdup
step this took about 2h. With this PR the find_clusterdups
step take about 1h 40min for chr1 and then 15min for the merge_clusterdups
step. So about the same time.
One surprising thing I found was that chrY took about 3 h to complete the same steps, this was also the case for the old clusterrmdup
step. I am unsure what accounts for this long runtime, reads and coverage is much lower for this chromosome. I will setup a new issue with this for continued investigation. It is however not relevant for this PR.
Fix for https://github.com/FrickTobias/BLR/issues/214
Split clusterrmdup.py in two files:
I implemented a graph based handling of connected barcodes (i.e. clusters) using class BarcodeGraph in place of the previously used
merge_dict
. This enables simpler merging of different graphs for synchronisation and is, at least to me, easier to follow.Updated workflow shown below.
find_clusterdups
identifies cluster (barcode) duplicates in each chunk and outputs a pickle file (adict
) containing the duplicates. Each key in the dict is a barcode barcode and the value is a set of connected barcodes.get_barcode_merges
takes in the pickle files from all chunks and merges the dicts to synchronise between chunks. It outputs a human-readable CSVfinal.barcode-merges.csv
in the same format as previously outputted byclusterrmdup.py
i.e.<current_barcode>,<new_barcode
for each barcode to update.merge_clusterdups
takes in an corrects each barcode based on the CSV file.