Closed the-x-at closed 3 years ago
I guess we would be faster without the sparse matrix, but then the memory demand would be larger. What would be your suggestion? I guess we could try to reduce the amount of data stored in each permutation to reduce memory, but I don't see a way to increase performance? Question is also whether we could not parallelize that, e.g. by splitting the workload to e.g. 10 CPUs, each carrying out 10,000 simulations?
The kinship matrix is almost a constant, it just depends on the size of the pedigree. The bigger issue derives from the number of simulations N and the number of affected members, A. Here, I present eleven runs with
on said architecture that I mentioned in the initial post.
N | Time (s) | Max. mem (KB) |
---|---|---|
1,000 | 61 | 642,848 |
2,000 | 92 | 564,812 |
4,000 | 132 | 619,124 |
8,000 | 236 | 731,824 |
16,000 | 437 | 900,644 |
32,000 | 799 | 1,368,572 |
64,000 | 1,400 | 2,558,352 |
128,000 | 3,088 | 5,054,656 |
256,000 | 5,794 | 8,911,296 |
512,000 | 12,445 | 15,579,764 |
1,024,000 | 24,803 | 34,090,076 |
At the beginning we see some fluctuations, but then there is a linear relationship between N and runtime, and much more important, the maximum memory used. The latter is attributed to the lapply
loop that runs the simulation and stores each simulation step's result in a list. The memory consumption of a list entry is proportional to the number of affected pedigree members, A, and its overall memory consumption is proportional to A * N.
Some numbers: if we want to achieve a P value prescision of 10-6, we need N = 106 simulation steps. This corresponds to the last line in the table and requires 34GB of memory. Since we find A = 600 cases, which is 5% of the overall phenotyped population of this example, it may happen that we find traits with A = 1000 cases, so we would have memory usage of roughly 1000/600*34 = 56 GB.
Parallelization efforts as @jorainer has mentioned, would move the problem away a bit. This however creates issues when using queuing systems, as the user has to pass additional parameters to the queuing system, otherwise it would kill the process due to excess CPU usage.
My suggestion here is to tackle the way the simulation results are stored. Ideally, they are added to the final data structures after each simulation step. This may be tricky, as we also compute densities, and they are based on the whole distribution and not parts of it. (And the same problem applies to parallelization efforts.)
A kinship sum analysis on 10,000 individuals with about 1,200 cases and 100,000 simulation steps results in extremely long runtime, more than a day. During this course, the application uses 40GB of memory. The same calculation on the same machine with only 400 cases is done in 4 hours (specs: 64GB memory and 8 Quad-Core AMD Opteron 8356 processors). Possible causes can be costly sparse matrix column access operations and a growing list during storing the intermediate results of each permutation step in a
lapply
loop.