dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
276 stars 132 forks source link

cleaner-hsm: `info` command needs more info, and better logging #7581

Open onnozweers opened 1 month ago

onnozweers commented 1 month ago

Dear dCache devs,

I'm currently troubleshooting cleaner-hsm congestion (version 9.2.18). A challenge I'm facing is that it can't see which runs the cleaner has submitted to which pools, and which of these runs it is still expecting a reply from.

We have seen that our HSM script did too many retries for removals, and this caused timeouts between cleaner-hsm and pool. Since the cleaner will do retries anyway, we have now disabled retries for removals in our HSM script. Still, things are not as smoothly as I hoped. It currently is very difficult to see:

It would be very helpful if such information was shown by the info command.

As for logging: with logging set to debug, new jobs are shown as "New run...", without any useful information. Only when a job finishes, there is some relevant info:

24 May 2024 14:53:21 (cleaner-hsm) [marten10_lofopsstage PoolRemoveFilesFromHSM] Received a remove reply from pool marten10_lofopsstage, which is no longer waited for. The cleaner pool timeout might be too small.
24 May 2024 14:53:21 (cleaner-hsm) [marten10_lofopsstage PoolRemoveFilesFromHSM] Pool delete responses for HSM osm: 834 success, 1166 failures

It would be helpful if the "New run" log entry included more details, especially to which node the run is submitted (so that I may monitor it there).

Additionally, there is this setting in the cleaner-hsm: https://github.com/dCache/dcache/blob/0597b1f47a47738a023b5fc81fe2fec684f80926/skel/share/defaults/cleaner-hsm.properties#L64 It's currently not clear (to me at least) what effect this setting has. Is it the maximum number of concurrent runs? It might help if the info command showed what these threads were doing.

If someone were to pick this up, it might be efficient to apply similar changes to the cleaner-disk.

Thanks!

onnozweers commented 1 month ago

Apparently our cleaner-hsm was a bit confused. I restarted it, and now at least it seems to log to which node it is sending runs.

27 May 2024 11:31:44 (cleaner-hsm) [] New run...
27 May 2024 11:31:44 (cleaner-hsm) [] sending 1 delete locations for HSM osm to pool otter12_generictape
27 May 2024 11:31:46 (cleaner-hsm) [otter12_generictape PoolRemoveFilesFromHSM] Pool delete responses for HSM osm: 1 success, 0 failures
27 May 2024 11:31:46 (cleaner-hsm) [otter12_generictape PoolRemoveFilesFromHSM] sending 100 delete locations for HSM osm to pool hedgehog8_generictape
27 May 2024 11:32:27 (cleaner-hsm) [hedgehog8_generictape PoolRemoveFilesFromHSM] Pool delete responses for HSM osm: 100 success, 0 failures
27 May 2024 11:32:27 (cleaner-hsm) [hedgehog8_generictape PoolRemoveFilesFromHSM] sending 100 delete locations for HSM osm to pool seacow8_projectminestage

Not sure though why the responses seem to be logged before the sending entries.

About the state of the cleaner: could the following lines (from before the restart) shed any light on it having a huge backlog?

27 May 2024 11:27:00 (cleaner-hsm) [] The number of cached HSM locations is already the maximum permissible size. Not adding further entries.
27 May 2024 11:27:30 (cleaner-hsm) [] HikariPool-2 - Before cleanup stats (total=1, active=0, idle=1, waiting=0)
27 May 2024 11:27:30 (cleaner-hsm) [] HikariPool-2 - After cleanup  stats (total=1, active=0, idle=1, waiting=0)
27 May 2024 11:27:30 (cleaner-hsm) [] HikariPool-2 - Fill pool skipped, pool has sufficient level or currently being filled (queueDepth=0).
lemora commented 1 month ago

Hi Onno. Could you plese send your entire cleaner-hsm configuration? I'll look into what can be done in terms of improved info/logging.

Cleaner-disk uses the configured threads to clean that many pools in parallel and reports which ones are currently targeted in info, but this does not make sense for cleaner-hsm.

onnozweers commented 1 month ago

Hi Lea,

Here is our current config. I included the cleaner-disk settings.

[cleanerDomain]
[cleanerDomain/cleaner-disk]
# Number of threads that should work on deletions simultaneously.
# One thread does HSM deletions, the others disk deletions.
# Default: 5
cleaner-disk.limits.threads = 20
# Interval at which the cleaner runs. Default: 120
cleaner-disk.limits.period = 30

[cleanerDomain/cleaner-hsm]
# Each run, the HSM cleaner will send this many deletions
# to a randomly chosen tape pool.
# Default: 500
#cleaner-hsm.limits.batch-size = 2000
# HSM deletions will be regard failure if they don't finish in time.
# Default: 120. If tape system is busy, that's not enough.
cleaner-hsm.service.pool.timeout = 300

We've done considerable tuning over the last few days to try and get rid of the cleaner-hsm backlog. After the restart, the cleaner-hsm reverted to these settings.

I'm not sure what you mean with: "but this does not make sense for cleaner-hsm".

onnozweers commented 1 month ago

I just noticed this log entry:

27 May 2024 11:47:24 (cleaner-hsm) [hedgehog11_tropomirwtape PoolRemoveFilesFromHSM] sending 100 delete locations for HSM osm to pool koala9_tropomirwtape

Why are there two different pool names in this entry? Is one pool sending or forwarding work to another?

onnozweers commented 1 month ago

Another question. By monitoring the number of 'delete' operations in the pool's info output, I can see that even when the removals have all completed in time, the cleaner still considers them a failure after a timeout. There is a few minutes between the moment the deletes on the pool reach 0/0 (13:28:33) and the moment the cleaner-hsm logs a timeout for the same pool.

27 May 2024 13:27:32 (cleaner-hsm) [] sending 500 delete locations for HSM osm to pool penguin9_projectminestage
27 May 2024 13:27:45 (cleaner-hsm) [] New run...
27 May 2024 13:29:45 (cleaner-hsm) [] New run...
27 May 2024 13:31:45 (cleaner-hsm) [] New run...
27 May 2024 13:32:32 (cleaner-hsm) [] Timeout deleting files on HSM osm attached to penguin9_projectminestage

As you can see from the plateaus in this graph, this seems to happen a lot, and it slows down the processing.

image

Why does the cleaner get a timeout even when all deletes have completed? Or should I make a separate ticket for this?