EuracBiomedicalResearch / FamAgg

This is the development version of the FamAgg Bioconductor package.
https://EuracBiomedicalResearch.github.io/FamAgg
MIT License
0 stars 2 forks source link

Reduce memory footprint #22

Closed the-x-at closed 3 years ago

the-x-at commented 3 years ago

A kinship sum analysis on 10,000 individuals with about 1,200 cases and 100,000 simulation steps results in extremely long runtime, more than a day. During this course, the application uses 40GB of memory. The same calculation on the same machine with only 400 cases is done in 4 hours (specs: 64GB memory and 8 Quad-Core AMD Opteron 8356 processors). Possible causes can be costly sparse matrix column access operations and a growing list during storing the intermediate results of each permutation step in a lapply loop.

jorainer commented 3 years ago

I guess we would be faster without the sparse matrix, but then the memory demand would be larger. What would be your suggestion? I guess we could try to reduce the amount of data stored in each permutation to reduce memory, but I don't see a way to increase performance? Question is also whether we could not parallelize that, e.g. by splitting the workload to e.g. 10 CPUs, each carrying out 10,000 simulations?

the-x-at commented 3 years ago

The kinship matrix is almost a constant, it just depends on the size of the pedigree. The bigger issue derives from the number of simulations N and the number of affected members, A. Here, I present eleven runs with

on said architecture that I mentioned in the initial post.

N Time (s) Max. mem (KB)
1,000 61 642,848
2,000 92 564,812
4,000 132 619,124
8,000 236 731,824
16,000 437 900,644
32,000 799 1,368,572
64,000 1,400 2,558,352
128,000 3,088 5,054,656
256,000 5,794 8,911,296
512,000 12,445 15,579,764
1,024,000 24,803 34,090,076

At the beginning we see some fluctuations, but then there is a linear relationship between N and runtime, and much more important, the maximum memory used. The latter is attributed to the lapply loop that runs the simulation and stores each simulation step's result in a list. The memory consumption of a list entry is proportional to the number of affected pedigree members, A, and its overall memory consumption is proportional to A * N.

Some numbers: if we want to achieve a P value prescision of 10-6, we need N = 106 simulation steps. This corresponds to the last line in the table and requires 34GB of memory. Since we find A = 600 cases, which is 5% of the overall phenotyped population of this example, it may happen that we find traits with A = 1000 cases, so we would have memory usage of roughly 1000/600*34 = 56 GB.

Parallelization efforts as @jorainer has mentioned, would move the problem away a bit. This however creates issues when using queuing systems, as the user has to pass additional parameters to the queuing system, otherwise it would kill the process due to excess CPU usage.

My suggestion here is to tackle the way the simulation results are stored. Ideally, they are added to the final data structures after each simulation step. This may be tricky, as we also compute densities, and they are based on the whole distribution and not parts of it. (And the same problem applies to parallelization efforts.)