Closed pontushojer closed 4 years ago
I have started a test run to see what kind of slowdown this introduces.
This is the results of the testrun. Run was on 20 cores and took about 20 hours. I run on the same dataset as @marcelm in https://github.com/NBISweden/BLR/pull/16, but I noticed that I had run using reference variants instead of calling. So the time for calling variants should be added to this but it is relatively quick. We have also noticed that phasing called variants takes much longer than using the reference ones, I will do a quick test with calling variants just to get a more accurate comparison.
So far it seams good though.
I rerun the final steps following the mapping step with the latest commit and the result is shown below.
The second spike to the full 20 cores is right at the filterclusters
step which demonstrates the bottleneck created herein. The runtime in not to affected however.
How does it look now @FrickTobias? I moved the rules as you suggested and commented on the other things.
How does it look now @FrickTobias? I moved the rules as you suggested and commented on the other things.
One last thing to resolve.
The last thing has been solve so I will go ahead an merge this as soon as the tests are done.
Fix for issue https://github.com/FrickTobias/BLR/issues/215.
buildmolecules.py no longer tags reads with MN tag and instead the final.molecule_stats.tsv is used for filtration. This introduces a slowdown in the processing as all chunks need to finish the
buildmolecules
step to continue (see DAG below). The TSV is used to generate a list (TXT file) of barcode that have too many molecules. This list is then used for all chunks to filter out barcodes in thefilterclusters
step.Changes included:
buildmolecules.py
: Skip tagging reads with molecule number (MN tag) as these are no longer correct. Don't include "NrMolecules" i TSV data.rule concat_molecule_stats
: TSV file should not include index.rule get_barcodes_to_filter
: Count number of molecules per barcode using concatenated TSV to generate a list of barcodes that are above the config defined "max_molecules_per_bc" threshold.filterclusters.py
: Filter BAM using list of barcodes from ruleget_barcodes_to_filter
.