Open onnozweers opened 6 months ago
Apparently our cleaner-hsm was a bit confused. I restarted it, and now at least it seems to log to which node it is sending runs.
27 May 2024 11:31:44 (cleaner-hsm) [] New run...
27 May 2024 11:31:44 (cleaner-hsm) [] sending 1 delete locations for HSM osm to pool otter12_generictape
27 May 2024 11:31:46 (cleaner-hsm) [otter12_generictape PoolRemoveFilesFromHSM] Pool delete responses for HSM osm: 1 success, 0 failures
27 May 2024 11:31:46 (cleaner-hsm) [otter12_generictape PoolRemoveFilesFromHSM] sending 100 delete locations for HSM osm to pool hedgehog8_generictape
27 May 2024 11:32:27 (cleaner-hsm) [hedgehog8_generictape PoolRemoveFilesFromHSM] Pool delete responses for HSM osm: 100 success, 0 failures
27 May 2024 11:32:27 (cleaner-hsm) [hedgehog8_generictape PoolRemoveFilesFromHSM] sending 100 delete locations for HSM osm to pool seacow8_projectminestage
Not sure though why the responses
seem to be logged before the sending
entries.
About the state of the cleaner: could the following lines (from before the restart) shed any light on it having a huge backlog?
27 May 2024 11:27:00 (cleaner-hsm) [] The number of cached HSM locations is already the maximum permissible size. Not adding further entries.
27 May 2024 11:27:30 (cleaner-hsm) [] HikariPool-2 - Before cleanup stats (total=1, active=0, idle=1, waiting=0)
27 May 2024 11:27:30 (cleaner-hsm) [] HikariPool-2 - After cleanup stats (total=1, active=0, idle=1, waiting=0)
27 May 2024 11:27:30 (cleaner-hsm) [] HikariPool-2 - Fill pool skipped, pool has sufficient level or currently being filled (queueDepth=0).
Hi Onno.
Could you plese send your entire cleaner-hsm
configuration?
I'll look into what can be done in terms of improved info/logging.
Cleaner-disk
uses the configured threads to clean that many pools in parallel and reports which ones are currently targeted in info
, but this does not make sense for cleaner-hsm
.
Hi Lea,
Here is our current config. I included the cleaner-disk settings.
[cleanerDomain]
[cleanerDomain/cleaner-disk]
# Number of threads that should work on deletions simultaneously.
# One thread does HSM deletions, the others disk deletions.
# Default: 5
cleaner-disk.limits.threads = 20
# Interval at which the cleaner runs. Default: 120
cleaner-disk.limits.period = 30
[cleanerDomain/cleaner-hsm]
# Each run, the HSM cleaner will send this many deletions
# to a randomly chosen tape pool.
# Default: 500
#cleaner-hsm.limits.batch-size = 2000
# HSM deletions will be regard failure if they don't finish in time.
# Default: 120. If tape system is busy, that's not enough.
cleaner-hsm.service.pool.timeout = 300
We've done considerable tuning over the last few days to try and get rid of the cleaner-hsm backlog. After the restart, the cleaner-hsm reverted to these settings.
I'm not sure what you mean with: "but this does not make sense for cleaner-hsm".
I just noticed this log entry:
27 May 2024 11:47:24 (cleaner-hsm) [hedgehog11_tropomirwtape PoolRemoveFilesFromHSM] sending 100 delete locations for HSM osm to pool koala9_tropomirwtape
Why are there two different pool names in this entry? Is one pool sending or forwarding work to another?
Another question. By monitoring the number of 'delete' operations in the pool's info
output, I can see that even when the removals have all completed in time, the cleaner still considers them a failure after a timeout. There is a few minutes between the moment the deletes on the pool reach 0/0 (13:28:33) and the moment the cleaner-hsm logs a timeout for the same pool.
27 May 2024 13:27:32 (cleaner-hsm) [] sending 500 delete locations for HSM osm to pool penguin9_projectminestage
27 May 2024 13:27:45 (cleaner-hsm) [] New run...
27 May 2024 13:29:45 (cleaner-hsm) [] New run...
27 May 2024 13:31:45 (cleaner-hsm) [] New run...
27 May 2024 13:32:32 (cleaner-hsm) [] Timeout deleting files on HSM osm attached to penguin9_projectminestage
As you can see from the plateaus in this graph, this seems to happen a lot, and it slows down the processing.
Why does the cleaner get a timeout even when all deletes have completed? Or should I make a separate ticket for this?
Dear dCache devs,
I'm currently troubleshooting
cleaner-hsm
congestion (version 9.2.18). A challenge I'm facing is that it can't see which runs the cleaner has submitted to which pools, and which of these runs it is still expecting a reply from.We have seen that our HSM script did too many retries for removals, and this caused timeouts between cleaner-hsm and pool. Since the cleaner will do retries anyway, we have now disabled retries for removals in our HSM script. Still, things are not as smoothly as I hoped. It currently is very difficult to see:
It would be very helpful if such information was shown by the
info
command.As for logging: with logging set to debug, new jobs are shown as "New run...", without any useful information. Only when a job finishes, there is some relevant info:
It would be helpful if the "New run" log entry included more details, especially to which node the run is submitted (so that I may monitor it there).
Additionally, there is this setting in the cleaner-hsm: https://github.com/dCache/dcache/blob/0597b1f47a47738a023b5fc81fe2fec684f80926/skel/share/defaults/cleaner-hsm.properties#L64 It's currently not clear (to me at least) what effect this setting has. Is it the maximum number of concurrent runs? It might help if the
info
command showed what these threads were doing.If someone were to pick this up, it might be efficient to apply similar changes to the
cleaner-disk
.Thanks!