Closed calestyo closed 4 years ago
The pushtag command discovers all directories within the subtree, creating a list of these directory identifiers (which are integers). For a large number of subdirectories, this list could be quite large; however, I'm surprised this would cause the JVM to run out of memory.
Could you update your dcache.conf to add the line:
dcache.java.options.short-lived.extra = -XX:+HeapDumpOnOutOfMemoryError
and recreate the problem?
This should result in the chimera shell generating a heap-dump when it runs of out memory, which should help identify the culprit.
From a private chat with Paul:
No .hprof file was created on the system... but there are hs_err_pid.log in /tmp on it.
I'll re-try with +XX:HeapDumpPath=/tmp
Guess it must be:
dcache.java.options.short-lived.extra = -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
+XX:HeapDumpPath=/tmp lead to:
Error: Could not find or load main class +XX:HeapDumpPath=.tmp
-XX:+HeapDumpPath=/tmp to:
Unexpected +/- setting in VM option 'HeapDumpPath=/tmp'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
when starting chimera
The hs_err_pid.log
file is significant. This indicates the JVM crashed, rather than this being an out-of-memory problem.
Could you confirm the hs_err*.log file comes from running chimera (e.g., delete the file and recreate the problem)?
If the hs_err.log file is from running chimera, could you post it here?
Will take a while to crash again.
I've sent the resulting files to your private mail (not sure whether these dumps may contain sensitive information).
It's probably really a memory issue... the VM has 4GB only... OTOH, such command should then fail gracefully :-)
Cheers, Chris.
Thanks for the details.
I went through the hs_err_pid<PID>.log
files and they all seem to be from crashed dCache domains, not from the chimera command.
Here is a summary:
$ awk '/^java_command:/{shift;c=$0}/^time:/{print $0", "c}' *.log
time: Thu Nov 14 18:47:27 2019, java_command: org.dcache.boot.BootLoader start filesystem
time: Thu Nov 14 18:50:44 2019, java_command: org.dcache.boot.BootLoader start filesystem
time: Thu Nov 14 19:03:49 2019, java_command: org.dcache.boot.BootLoader start srm_lcg-lrz-test-dcache
time: Thu Nov 14 19:05:55 2019, java_command: org.dcache.boot.BootLoader start srm_lcg-lrz-test-dcache
time: Thu Nov 14 19:40:15 2019, java_command: org.dcache.boot.BootLoader start srm_lcg-lrz-test-dcache
time: Thu Nov 14 19:49:14 2019, java_command: org.dcache.boot.BootLoader start filesystem
time: Thu Nov 14 20:40:16 2019, java_command: org.dcache.boot.BootLoader start srm_lcg-lrz-test-dcache
time: Thu Nov 14 20:40:30 2019, java_command: org.dcache.boot.BootLoader start srm_lcg-lrz-test-dcache
time: Thu Nov 14 20:40:42 2019, java_command: org.dcache.boot.BootLoader start srm_lcg-lrz-test-dcache
$
The machine has 4 GiB of physical RAM, with no swap space. Both srm_lcg-lrz-test-dcache
and filesystem
domains appear to be configured to use 512 MiB (with up to another 512 MiB used as direct memory).
I cannot say what is going wrong with the chimera shell, as there's (apparently) no heap-dump and no hs_err_pid.log
file.
Since memory on this machine seems to be quite tight (and there is no swap space), perhaps the chimera shell was killed by the Linux kernel "oom killer".
Two things:
try to free up some memory on this machine.
check the linux kernel log to see if the "oom killer" was active.
Cheers, Paul.
Hi Chris,
Any progress on this ticket? I'm waiting on a couple of things from you.
Cheers, Paul.
Hey.
Sorry I kinda oversaw that. I've had increased memory on the machine and it in fact worked then.
So other than finding a more memory efficient way of pushtag, one can probably close this ticket.
Thanks, Chris.
Thanks for the update Chris.
I think it's fair to conclude that the problem was due to a lack of memory; although (AFAIK) it is unusual for an out-of-memory problem to leave so little information about where all the memory has gone. Without that information, it's very difficult to see what should be fixed.
So, with the lack on anything concrete to work on, and that the problem has been resolved by simply giving the machine more memory, I'm afraid we'll have to close the ticket here.
Hey.
On 6.0.0 I took a dump of our production dCache's chimera and tried (in a test instance) to use the new pushtag feature to properly inherit the tags in a pre-existing directory structure.
However, at some point this runs out of memory and the process is killed:
Arguably, that test instance (which is only a VM) doesn't have very much memory, however I'm a bit concerned now to let this run on the production instance...
Is there any reason why this eats up so much memory?
Cheers, Chris