dougspeed / LDAK

Other
12 stars 1 forks source link

ldak6 crashes #10

Open caldodge opened 1 month ago

caldodge commented 1 month ago

We are running the latest LDAK6 in a Red Hat 8.8 cluster, using the Slurm scheduler. When the job runs, we get the following output:

(base) [laip@esplhpccompbio-lv01 MegaPRS]$ cat JOB1003351.out


LDAK - Software for obtaining Linkage Disequilibrium Adjusted Kinships and Loads More Version 6 - Help pages at www.dougspeed.com


There are 3 pairs of arguments: --join-cors cors --corslist list.txt --max-threads 6


Joining 23 correlations


Processing correlations from cors1.cors.bin (File 1 out of 23) Processing correlations from cors2.cors.bin (File 2 out of 23) Processing correlations from cors3.cors.bin (File 3 out of 23) Processing correlations from cors4.cors.bin (File 4 out of 23) Processing correlations from cors5.cors.bin (File 5 out of 23) Processing correlations from cors6.cors.bin (File 6 out of 23) Processing correlations from cors7.cors.bin (File 7 out of 23) Processing correlations from cors8.cors.bin (File 8 out of 23) Processing correlations from cors9.cors.bin (File 9 out of 23) Processing correlations from cors10.cors.bin (File 10 out of 23) Processing correlations from cors11.cors.bin (File 11 out of 23) Processing correlations from cors12.cors.bin (File 12 out of 23) Processing correlations from cors13.cors.bin (File 13 out of 23) Processing correlations from cors14.cors.bin (File 14 out of 23) Processing correlations from cors15.cors.bin (File 15 out of 23) Processing correlations from cors16.cors.bin (File 16 out of 23) Processing correlations from cors17.cors.bin (File 17 out of 23) Processing correlations from cors18.cors.bin (File 18 out of 23) Processing correlations from cors19.cors.bin (File 19 out of 23) Processing correlations from cors20.cors.bin (File 20 out of 23) Processing correlations from cors21.cors.bin (File 21 out of 23) Processing correlations from cors22.cors.bin (File 22 out of 23) Processing correlations from cors23.cors.bin (File 23 out of 23)

The joined correlations are saved in files with prefix cors

corrupted size vs. prev_size (program ends)

My recollection was that the "Corrupted size vs. prev size" was usually a matter of library versions, but ldap6 is a static binary.

What other information do I need to provide?

caldodge commented 1 month ago

Here is the script which creates the above output:

for CHR in {1..23}; do echo cors${CHR} >> list.txt; done ~/ldak6.linux \ --join-cors cors \ --corslist list.txt \ --max-threads 6

The final contents of list.txt cors1 cors2 cors3 cors4 cors5 cors6 cors7 cors8 cors9 cors10 cors11 cors12 cors13 cors14 cors15 cors16 cors17 cors18 cors19 cors20 cors21 cors22 cors23

The job requests 300 GB of RAM, and all 48 CPU cores. We see nothing in the system logs to correlate with the program crash.

dougspeed commented 1 month ago

Thanks for the message, and sorry for the error.

How many predictors do you have? Are they in a single file? If relatively few (say less than 2M) then you can work around by not dividing by chromosomes (ie use --calc-cors for all predictors at once)

However, I will explore the error to find out the problem. Please note that this function does not use much memory (because it does on the fly read and save). Note also, it does not benefit from multiple CPU (it will not hurt, but on our cluster, we pay per CPU).