With the introduction of https://github.com/NBISweden/BLR/pull/16 for parallel processing of chunks, the read information is not synchronised over all chunks. For the buildmolecules this means that the MN tag (for keeping track of the number of molecules connected to a barcode) will no longer be correct as this is set independently for all chunks. This means that the filtering in filterclusters cannot be performed correctly.
I brought this up as a part of https://github.com/NBISweden/BLR/pull/16 (see comments there) and also discuss some with @FrickTobias. My idea would be to scrap the MN tag assigned in buildmolecules and instead use the file final.molecule_stats.tsv to find which barcodes to filter out. This file is already merged for all chunks and would only need some simple processing to use for filtering out barcodes with too many molecules in filterclusters.
With the introduction of https://github.com/NBISweden/BLR/pull/16 for parallel processing of chunks, the read information is not synchronised over all chunks. For the
buildmolecules
this means that the MN tag (for keeping track of the number of molecules connected to a barcode) will no longer be correct as this is set independently for all chunks. This means that the filtering infilterclusters
cannot be performed correctly.I brought this up as a part of https://github.com/NBISweden/BLR/pull/16 (see comments there) and also discuss some with @FrickTobias. My idea would be to scrap the MN tag assigned in
buildmolecules
and instead use the filefinal.molecule_stats.tsv
to find which barcodes to filter out. This file is already merged for all chunks and would only need some simple processing to use for filtering out barcodes with too many molecules infilterclusters
.