FrickTobias / BLR

MIT License
6 stars 5 forks source link

Rewrite `buildmolecules` and `filterclusters` to synchronise information over chunks. #215

Closed pontushojer closed 4 years ago

pontushojer commented 4 years ago

With the introduction of https://github.com/NBISweden/BLR/pull/16 for parallel processing of chunks, the read information is not synchronised over all chunks. For the buildmolecules this means that the MN tag (for keeping track of the number of molecules connected to a barcode) will no longer be correct as this is set independently for all chunks. This means that the filtering in filterclusters cannot be performed correctly.

I brought this up as a part of https://github.com/NBISweden/BLR/pull/16 (see comments there) and also discuss some with @FrickTobias. My idea would be to scrap the MN tag assigned in buildmolecules and instead use the file final.molecule_stats.tsv to find which barcodes to filter out. This file is already merged for all chunks and would only need some simple processing to use for filtering out barcodes with too many molecules in filterclusters.