Closed pontushojer closed 4 years ago
Thanks for your comments anyway @marcelm ! If there are no other objections I will merge this.
BTW I also did a testrun on the same dataset as @marcelm and this is the result. The gain here is a bit less, the max use is around 20 Gb.
I also want to point out that from how the code is written the memory load from each chromosome is more or less stacked. So splitting this over chromosomes would most likely distribute the load by the read count in each chromosome. For chromosome 1 this would in this dataset translate to below 2 Gb max load which is enough even when running on a single core on uppmax (about 3.6 Gb per core).
No objections here :-), looking forward to trying this out when I’m done with the parallel
branch.
By the way, thanks for the pointer to psrecord, looks really useful!
By the way, thanks for the pointer to psrecord, looks really useful!
Yes, I found it while looking into this issue! It was really easy to use and did everything I wanted.
@marcelm noted that
buildmolecules.py
used quite a considerable amount of memory, this is PR trying to minimise memory usage for the script. I also did some factoring out parts of thebuild_molecules
function and did other fixes and style changes.The main improvements are:
cache_dict
object (now calledmolecules_cache
) stored allMolecule
instances and was only cleared for each chromosome. I now implemented anOrderedDict
to keep track of the stored molecules add report and remove any that are outside the current window.bc_to_mol_dict
stored a set of fullMolecule
instances for each barcode. I changed this so that a list of dicts containing only the required information are kept.This seams to have helped somewhat as you can see from the figures below (run on chr22). The peak memory use is now down about ~55%.
I have done several testruns on chr22 to confirm that the output is the same. The only difference is the order of the columns in the
molecule_stats.tsv
where the column "NrMolecules" has moved to the end.Figure 1: Profile for
master
, generated using psrecord.Figure 2: Profile for
buildmol-memfix
, generated using psrecord.