dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

chimera pushtag runs out of memory #5163

Closed calestyo closed 4 years ago

calestyo commented 4 years ago

Hey.

On 6.0.0 I took a dump of our production dCache's chimera and tried (in a test instance) to use the new pushtag feature to properly inherit the tags in a pre-existing directory structure.

However, at some point this runs out of memory and the process is killed:

chimera:/# writetag /pnfs/lrz-muenchen.de/data/atlas/dq2/atlasdatadisk WriteToken 11111
chimera:/# pushtag /pnfs/lrz-muenchen.de/data/atlas/dq2/atlasdatadisk WriteToken
Killed

Arguably, that test instance (which is only a VM) doesn't have very much memory, however I'm a bit concerned now to let this run on the production instance...

Is there any reason why this eats up so much memory?

Cheers, Chris

paulmillar commented 4 years ago

The pushtag command discovers all directories within the subtree, creating a list of these directory identifiers (which are integers). For a large number of subdirectories, this list could be quite large; however, I'm surprised this would cause the JVM to run out of memory.

Could you update your dcache.conf to add the line:

dcache.java.options.short-lived.extra = -XX:+HeapDumpOnOutOfMemoryError

and recreate the problem?

This should result in the chimera shell generating a heap-dump when it runs of out memory, which should help identify the culprit.

calestyo commented 4 years ago

From a private chat with Paul:

No .hprof file was created on the system... but there are hs_err_pid.log in /tmp on it.

I'll re-try with +XX:HeapDumpPath=/tmp

calestyo commented 4 years ago

Guess it must be: dcache.java.options.short-lived.extra = -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp

+XX:HeapDumpPath=/tmp lead to: Error: Could not find or load main class +XX:HeapDumpPath=.tmp -XX:+HeapDumpPath=/tmp to:

Unexpected +/- setting in VM option 'HeapDumpPath=/tmp'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

when starting chimera

paulmillar commented 4 years ago

The hs_err_pid.log file is significant. This indicates the JVM crashed, rather than this being an out-of-memory problem.

Could you confirm the hs_err*.log file comes from running chimera (e.g., delete the file and recreate the problem)?

If the hs_err.log file is from running chimera, could you post it here?

calestyo commented 4 years ago

Will take a while to crash again.

calestyo commented 4 years ago

I've sent the resulting files to your private mail (not sure whether these dumps may contain sensitive information).

It's probably really a memory issue... the VM has 4GB only... OTOH, such command should then fail gracefully :-)

Cheers, Chris.

paulmillar commented 4 years ago

Thanks for the details.

I went through the hs_err_pid<PID>.log files and they all seem to be from crashed dCache domains, not from the chimera command.

Here is a summary:

$ awk '/^java_command:/{shift;c=$0}/^time:/{print $0", "c}'  *.log
time: Thu Nov 14 18:47:27 2019, java_command: org.dcache.boot.BootLoader start filesystem
time: Thu Nov 14 18:50:44 2019, java_command: org.dcache.boot.BootLoader start filesystem
time: Thu Nov 14 19:03:49 2019, java_command: org.dcache.boot.BootLoader start srm_lcg-lrz-test-dcache
time: Thu Nov 14 19:05:55 2019, java_command: org.dcache.boot.BootLoader start srm_lcg-lrz-test-dcache
time: Thu Nov 14 19:40:15 2019, java_command: org.dcache.boot.BootLoader start srm_lcg-lrz-test-dcache
time: Thu Nov 14 19:49:14 2019, java_command: org.dcache.boot.BootLoader start filesystem
time: Thu Nov 14 20:40:16 2019, java_command: org.dcache.boot.BootLoader start srm_lcg-lrz-test-dcache
time: Thu Nov 14 20:40:30 2019, java_command: org.dcache.boot.BootLoader start srm_lcg-lrz-test-dcache
time: Thu Nov 14 20:40:42 2019, java_command: org.dcache.boot.BootLoader start srm_lcg-lrz-test-dcache
$ 

The machine has 4 GiB of physical RAM, with no swap space. Both srm_lcg-lrz-test-dcache and filesystem domains appear to be configured to use 512 MiB (with up to another 512 MiB used as direct memory).

I cannot say what is going wrong with the chimera shell, as there's (apparently) no heap-dump and no hs_err_pid.log file.

Since memory on this machine seems to be quite tight (and there is no swap space), perhaps the chimera shell was killed by the Linux kernel "oom killer".

Two things:

  1. try to free up some memory on this machine.

  2. check the linux kernel log to see if the "oom killer" was active.

Cheers, Paul.

paulmillar commented 4 years ago

Hi Chris,

Any progress on this ticket? I'm waiting on a couple of things from you.

Cheers, Paul.

calestyo commented 4 years ago

Hey.

Sorry I kinda oversaw that. I've had increased memory on the machine and it in fact worked then.

So other than finding a more memory efficient way of pushtag, one can probably close this ticket.

Thanks, Chris.

paulmillar commented 4 years ago

Thanks for the update Chris.

I think it's fair to conclude that the problem was due to a lack of memory; although (AFAIK) it is unusual for an out-of-memory problem to leave so little information about where all the memory has gone. Without that information, it's very difficult to see what should be fixed.

So, with the lack on anything concrete to work on, and that the problem has been resolved by simply giving the machine more memory, I'm afraid we'll have to close the ticket here.