YellProgram / Yell

Yell is a program for diffuse scattering interpretation using the 3D-∆PDF refinement.
Other
12 stars 3 forks source link

Memory allocation #49

Open k-eks opened 6 years ago

k-eks commented 6 years ago

I have observed it several times now that memory consumption seems to be erratic. When I submit yell calculations to the ETH computing cluster, some of them get killed because of too much memory allocation. In theory, no model that I submitted should consume more than 16 GB of memory at a peak moment. Now I have submitted a model (with a theoretical memory consumption of about 8 GB), it got rejected again because of too much memory allocation and resubmitted it with a request for more memory allocation (28 GB). It worked without a glitch and only used about 8 GB. Protocols below.

Rejected run

Job </cluster/home/messmerd/yell> was submitted from host <eu-login-14-ng> by user <messmerd> in cluster <euler> at Thu Oct  4 17:52:39 2018.
Job was executed on host(s) <eu-ms-018-45>, in queue <normal.24h>, as user <messmerd> in cluster <euler> at Thu Oct  4 17:53:10 2018.
</cluster/home/messmerd> was used as the home directory.
</cluster/scratch/messmerd/daniel/model1> was used as the working directory.
Started at Thu Oct  4 17:53:10 2018.
Terminated at Thu Oct  4 17:53:42 2018.
Results reported at Thu Oct  4 17:53:42 2018.

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/cluster/home/messmerd/yell
------------------------------------------------------------

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 130.

Resource usage summary:

   CPU time :                                   24.00 sec.
   Max Memory :                                 30500 MB
   Average Memory :                             1061.00 MB
   Total Requested Memory :                     14000.00 MB
   Delta Memory :                               -16500.00 MB
   Max Swap :                                   -
   Max Processes :                              3
   Max Threads :                                4
   Run time :                                   53 sec.
   Turnaround time :                            63 sec.

Read file <lsf.o74779711> for stdout and stderr output of this job.

Same model, more memory, successful run, less memory consumption than before

Job </cluster/home/messmerd/yell> was submitted from host <eu-login-14-ng> by user <messmerd> in cluster <euler> at Thu Oct  4 17:56:45 2018.
Job was executed on host(s) <eu-ms-006-25>, in queue <normal.24h>, as user <messmerd> in cluster <euler> at Fri Oct  5 06:13:34 2018.
</cluster/home/messmerd> was used as the home directory.
</cluster/scratch/messmerd/daniel/model1> was used as the working directory.
Started at Fri Oct  5 06:13:34 2018.
Terminated at Fri Oct  5 09:29:15 2018.
Results reported at Fri Oct  5 09:29:15 2018.

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/cluster/home/messmerd/yell
------------------------------------------------------------

Successfully completed.

Resource usage summary:

   CPU time :                                   11734.86 sec.
   Max Memory :                                 8384 MB
   Average Memory :                             7260.04 MB
   Total Requested Memory :                     28000.00 MB
   Delta Memory :                               19616.00 MB
   Max Swap :                                   -
   Max Processes :                              3
   Max Threads :                                4
   Run time :                                   11759 sec.
   Turnaround time :                            55950 sec.

Read file <lsf.o74780342> for stdout and stderr output of this job.
aglie commented 6 years ago

Oh, great you pinpointed it.

Could you save this particular model please? I will take a look when I am reconnected with my compilation pipeline.

k-eks commented 6 years ago

That was a close call, I was about to overwrite it for my next batch of calculations 🗡:D Here you go!

model.txt

aglie commented 6 years ago

Amazing Gregor, you are a star!

Though this erratic behaviour suggests I have screwed up something quite royally. I will try to find the bug.

k-eks commented 6 years ago

might be related to #47

aglie commented 7 months ago

Should be fixed now