Reduce memory footprint

the-x-at commented 3 years ago

A kinship sum analysis on 10,000 individuals with about 1,200 cases and 100,000 simulation steps results in extremely long runtime, more than a day. During this course, the application uses 40GB of memory. The same calculation on the same machine with only 400 cases is done in 4 hours (specs: 64GB memory and 8 Quad-Core AMD Opteron 8356 processors). Possible causes can be costly sparse matrix column access operations and a growing list during storing the intermediate results of each permutation step in a lapply loop.

jorainer commented 3 years ago

I guess we would be faster without the sparse matrix, but then the memory demand would be larger. What would be your suggestion? I guess we could try to reduce the amount of data stored in each permutation to reduce memory, but I don't see a way to increase performance? Question is also whether we could not parallelize that, e.g. by splitting the workload to e.g. 10 CPUs, each carrying out 10,000 simulations?

the-x-at commented 3 years ago

The kinship matrix is almost a constant, it just depends on the size of the pedigree. The bigger issue derives from the number of simulations N and the number of affected members, A. Here, I present eleven runs with

a pedigree that has ~24,000 participants,
of those, ~12,000 are phenotyped,
of those, A = 600 are affected,
a variable number of simulation steps N,
applying the kinship sum test,
and omitting the kinship matrix calculation (in this case takes almost 20GB)

on said architecture that I mentioned in the initial post.

N	Time (s)	Max. mem (KB)
1,000	61	642,848
2,000	92	564,812
4,000	132	619,124
8,000	236	731,824
16,000	437	900,644
32,000	799	1,368,572
64,000	1,400	2,558,352
128,000	3,088	5,054,656
256,000	5,794	8,911,296
512,000	12,445	15,579,764
1,024,000	24,803	34,090,076

At the beginning we see some fluctuations, but then there is a linear relationship between N and runtime, and much more important, the maximum memory used. The latter is attributed to the lapply loop that runs the simulation and stores each simulation step's result in a list. The memory consumption of a list entry is proportional to the number of affected pedigree members, A, and its overall memory consumption is proportional to A * N.

Some numbers: if we want to achieve a P value prescision of 10^-6, we need N = 10⁶ simulation steps. This corresponds to the last line in the table and requires 34GB of memory. Since we find A = 600 cases, which is 5% of the overall phenotyped population of this example, it may happen that we find traits with A = 1000 cases, so we would have memory usage of roughly 1000/600*34 = 56 GB.

Parallelization efforts as @jorainer has mentioned, would move the problem away a bit. This however creates issues when using queuing systems, as the user has to pass additional parameters to the queuing system, otherwise it would kill the process due to excess CPU usage.

My suggestion here is to tackle the way the simulation results are stored. Ideally, they are added to the final data structures after each simulation step. This may be tricky, as we also compute densities, and they are based on the whole distribution and not parts of it. (And the same problem applies to parallelization efforts.)

EuracBiomedicalResearch / FamAgg

Reduce memory footprint #22