FrickTobias / BLR

MIT License
6 stars 5 forks source link

Rewrite `clusterrmdup` to synchronise information over chunks. #214

Closed pontushojer closed 4 years ago

pontushojer commented 4 years ago

With the introduction of https://github.com/NBISweden/BLR/pull/16 for parallel processing of chunks, the read information is not synchronised over all chunks. For the clusterrmdup step this means that some clusters merged for one chunk are kept intact in others. This may not be a huge issue for some analysis steps but would definitely impact the ability to detect interchromosomal variants e.g. translocations.

I brought this up as a part of https://github.com/NBISweden/BLR/pull/16 (see comments there) and also discuss some with @FrickTobias. Our idea is parse the BAM once to get a list of cluster (barcodes to merge). This list is then synchronised for all chucks before updating the barcode information (BX tag).