dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

Pools dying after "Used space is less than the sum of precious and removable space" #5162

Open onnozweers opened 4 years ago

onnozweers commented 4 years ago

Dear dCache devs,

I cleaned up lots of files from a VO, and then 4 pools in their poolgroup died. This is in the log of one of them:

12 Nov 2019 03:32:13 () [] Used space is less than the sum of precious and removable space (this may be a temporary problem - if it persists then please report it to support@dcache.org).
12 Nov 2019 03:33:13 () [] Used space is less than the sum of precious and removable space (this may be a temporary problem - if it persists then please report it to support@dcache.org).
[...repeated...]
12 Nov 2019 17:33:13 () [] Used space is less than the sum of precious and removable space (this may be a temporary problem - if it persists then please report it to support@dcache.org).
12 Nov 2019 17:34:13 () [] Used space is less than the sum of precious and removable space (this may be a temporary problem - if it persists then please report it to support@dcache.org).
12 Nov 2019 17:34:33 (guppy1_ildgstage) [cleaner PoolRemoveFiles] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Pool restart required: Internal repository error
12 Nov 2019 17:34:33 (guppy1_ildgstage) [cleaner PoolRemoveFiles] Pool: guppy1_ildgstage, fault occurred in repository: Internal repository error. Pool restart required: , cause: java.lang.IllegalArgumentException: Cannot set used space to a negative value.

Any ideas what could cause this?

alrossi commented 4 years ago

Not sure. It is possible for a momentary negative value because the check being done is not atomic. However, it looks like the error propagates to the repository and causes a crash, so it may not be ephemeral.

I will take a closer look at the code; in the meantime, can I ask you: have you modified the reclaim property on the sweeper (i.e., > 0) ?

e.g.,

pool.limits.sweeper-margin=0.30

?

DmitryLitvintsev commented 4 years ago

First question : what version of dCache is it? And you performed massive removal. Also, if possible can you capture

info -a rep ls

if this happens again

onnozweers commented 4 years ago

We have no specific sweeper values like pool.limits.sweeper-margin.

dCache 5.2.4, but pools have been dying from time to time for a while now. We even have a cron job that resurrects dead pools.

The pool I checked is empty now. I found a pool in another poolgroup that had the same warning in its log: "Used space is less than the sum of precious and removable space (this may be a temporary problem - if it persists then please report it to support@dcache.org)." But this pool crashed around noon, was restarted by our cron job and now it doesn't generate that warning. Still, here is the info for that pool:

--- csm (Checksum module) ---
          Checksum type : adler32

--- default-transfer-service (Generic transfer service) ---

--- flush (Controller for centralising flushing) ---
   Flush interval                : 60000 ms
   Maximum classes flushing      : 1000
   Minimum flush delay on error  : 60000 ms
   Next flush                    : Tue Nov 12 22:39:44 CET 2019

--- jtm (Job timeout manager) ---
  gridftp_queue (lastAccess=3600;total=14400)
  regular (lastAccess=72000;total=604800)
  p2p (lastAccess=0;total=0)

--- migration (Replica migration module client) ---

--- migration-server (Replica migration module backend) ---

--- nfs-transfer-service (NFSv4.1 transfer service) ---

--- pool (Main pool component) ---
Base directory    : /space/projectmine/stage/pool
Version           : 5.2.4(5.2.4) (Sub=4)
Report remove     : ON
Pool Mode         : enabled
Clean prec. files : ON
Hsm Load Suppr.   : OFF
Ping Heartbeat    : 30 seconds
Breakeven         : 0.7
LargeFileStore    : NONE
DuplicateRequests : NONE
P2P File Mode     : CACHED
Mover Queue (gridftp_queue) 1(12)/0
Mover Queue (regular) 0(100)/0

--- post-transfer-service (Post processing execution service) ---

--- pp (Pool to pool transfer manager) ---

--- queue (HSM flush queue manager) ---
   Classes  : 0
   Requests : 0

--- remote-gsiftp-transfer-service ---

--- remote-http-transfer-service ---

--- rep (Repository manager) ---
State : OPEN
Files : 1852
Disk space
    Total    : 5069167526388
    Used     : 5044314230126    [0.9950971641531624]
    Free     : 24853296262    Gap : 118111600640
    Precious : 0    [0.0]
    Removable: 4307444170540    [0.8497340338659589]
File system
    Size : 84006349570048
    Free : 13732391665664    [0.1634684965594578]
Limits for maximum disk space
    File system          : 18776705895790
    Statically configured: 4906003988480
    Runtime configured   : 4906003988480

--- storagehandler (Nearline storage manager) ---
 Restore Timeout  : 604800 seconds
   Store Timeout  : 14400 seconds
  Remove Timeout  : 14400 seconds
  Job Queues (active/queued)
    to store   0/0
    from store 2/0
    delete     0/0

I have to add, that sometimes a stage pool tries to kill a restore that has timed out, but that kill usually fails (because of complications in our HSM script we still need to fix). Our HSM may then proceed to stage a file which is then not properly registered. That might cause the pool dir to contain more data than is registered. I'm not sure whether this is related.

onnozweers commented 4 years ago

The 4 pools I mentioned were completely drained, they don't have any files left (which is what I expected after my cleanup).

DmitryLitvintsev commented 4 years ago

If possible, please capture this info while pool is complaining about negative space. If you sum up the sizes of 1852 will it match the reported (used, precious and removable).

do numbers seen in above output

   Used     : 5044314230126    [0.9950971641531624]
   Precious : 0    [0.0]
   Removable: 4307444170540

are consistent with rep ls -s -sum output. Have you been using 3-rd party transfers?

onnozweers commented 4 years ago

I've set up a cron job that warns me whenever a pool logs "Used space is less than the sum of precious and removable space". At the moment there are none.

For the data I cleaned up (the 4 pools), I don't know if that has been written with 3rd party transfers. Some of this data was in a generated dir, so I guess it comes from an LFC. For the 5th pool I know that some data has been uploaded with 3rd party transfers.

For the fifth pool, here is a fresh info -a & rep ls -s -sum (though the pool is not complaining about used space at the moment).

--- rep (Repository manager) ---
State : OPEN
Files : 1851
Disk space
    Total    : 5069167526388
    Used     : 5041554143855    [0.9945526790366946]
    Free     : 27613382533    Gap : 118111600640
    Precious : 0    [0.0]
    Removable: 4212548234191    [0.8310138128720757]
File system
    Size : 84006349570048
    Free : 13584122630144    [0.1617035224083507]
Limits for maximum disk space
    File system          : 18625676773999
    Statically configured: 4906003988480
    Runtime configured   : 4906003988480
total                     5069167526388  27613382533  4212548234191         0           0         0           0         0
projectmine:tape@osm      5041554143855      1851           0         0  829005909664        29  4212548234191      1822
onnozweers commented 4 years ago

Bingo.

--- rep (Repository manager) ---
State : OPEN
Files : 869
Disk space
    Total    : 10925888870810
    Used     : 10179842201600    [0.9317175309001023]
    Free     : 746046669210    Gap : 2199023255552
    Precious : 0    [0.0]
    Removable: 10271700326400    [0.940124913208869]
File system
    Size : 219444922023936
    Free : 88193080348672    [0.40189164340313294]
Limits for maximum disk space
    File system          : 98372922550272
    Statically configured: 10925888870810
    Runtime configured   : 10925888870810
total                     10925888870810  746046669210  10271700326400         0           0         0           0         0
lofops:tape@osm           11115629404160       869           0         0  843929077760       121  10271700326400       748
DmitryLitvintsev commented 4 years ago

Onno,

Comments? So indeed used - (precious + removable) < 0 :

>>> 10179842201600 - 10271700326400                      
-91858124800

and rep ls matches output.

What happens if you restart the pool. A assume repository re-syncs itself and numbers get corrected? Can you do it to confirm.

It looks like there is some issue in the way used space is calculated. Also, it did not occur to me, if you were to go to pool data area and sum up all replica sizes will it match 10271700326400 and are there exact 869 files?

I have never seen this error before. As far as trying to reproduce this. All one need to do is to remove a bunch of replicas from the pool is that correct?

Dmitry

paulmillar commented 4 years ago

I have observed this failure on prometheus:

I've copied the domain log file leading up to the problem below

13 May 2020 20:07:42 (pool2) [] Exception caught on channel 92b6479f: org.dcache.xrootd.core.XrootdException: Authentication to p06636710r84969.cern.ch:1095 failed; all protocols have been tried..
13 May 2020 20:07:42 (pool2) [] Exception caught on channel 92b6479f: org.dcache.xrootd.core.XrootdException: Invalid request; user not authenticated^@.
13 May 2020 20:07:43 (pool2) [door:xrootd-gsi@dCacheDomain:AAWli3S+Fgg xrootd-gsi PoolAcceptFile 000049A14E8699694D958EF7BB785530EF59] Transfer failed in post-processing: File size mismatch (expected=1000000000, actual=0)
13 May 2020 20:07:43 (pool2) [] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.nio.channels.ClosedChannelException: null
        at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:715)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:762)
[...]
        at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1089)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at org.dcache.util.CDCThreadFactory.lambda$newThread$0(CDCThreadFactory.java:50)
        at java.base/java.lang.Thread.run(Thread.java:834)
13 May 2020 20:13:33 (pool2) [door:webdav-secure-grid@dCacheDomain:AAWli4m09AA RemoteTransferManager PoolDeliverFile 0000B8C1B06F8D124909BB7027622467C6BF] Transfer failed: rejected PUT: 400 Bad Request
13 May 2020 20:13:43 (pool2) [] Exception caught on channel c0dd5571: org.dcache.xrootd.core.XrootdException: Authentication to lxfsre09a02.cern.ch:1095 failed; all protocols have been tried..
13 May 2020 20:13:43 (pool2) [] Exception caught on channel c0dd5571: org.dcache.xrootd.core.XrootdException: Invalid request; user not authenticated^@.
13 May 2020 20:13:43 (pool1) [] Exception caught on channel 6793c0d3: org.dcache.xrootd.core.XrootdException: Authentication required^@.
13 May 2020 20:13:43 (pool2) [door:xrootd-gsi@dCacheDomain:AAWli4pCOBA xrootd-gsi PoolAcceptFile 000000D91FC66BCD48C18BF6973CFCC63E86] Transfer failed in post-processing: File size mismatch (expected=1000000000, actual=0)
13 May 2020 20:13:44 (pool1) [door:xrootd-gsi@dCacheDomain:AAWli4pDhBg xrootd-gsi PoolAcceptFile 0000E6C208C44F6D44C283F2E983135326CF] Transfer failed in post-processing: File size mismatch (expected=1000000000, actual=0)
13 May 2020 20:14:22 (pool2) [door:webdav-secure-grid@dCacheDomain:AAWli4yrf4A RemoteTransferManager PoolAcceptFile 0000166BF880780D44F3B611D74F2BBC02FD] Transfer failed: rejected GET: 403 Forbidden
13 May 2020 20:20:12 (pool1) [door:webdav-secure-grid@dCacheDomain:AAWli6F2/AA RemoteTransferManager PoolAcceptFile 00006BB2AC6A2FB544B6A986A0EB6F8DFFF9] Transfer failed: The peer's certificate with subject's DN CN=xrootd-local.unl.edu,OU=Holland Computing Center,O=University of Nebraska-Lincoln,L=Lincoln,ST=NE,C=US,DC=incommon,DC=org was rejected. The peer's certificate status is: FAILED The following validation errors were found:;error at position 2 in chain, problematic certificate subject: CN=COMODO RSA Certification Authority,O=COMODO CA Limited,L=Salford,ST=Greater Manchester,C=GB (category: NAMESPACE): Namespace definition for the certificate issuer (CN=AddTrust External CA Root,OU=AddTrust External TTP Network,O=AddTrust AB,C=SE) is not defined, and namespaces are configured to be required.
13 May 2020 20:21:54 (pool1) [] Exception caught on channel d7e9e089: org.dcache.xrootd.core.XrootdException: Unable to open /dpm/cern.ch/home/dteam/domatests/root/domatest/file9_bcaef027-d26d-44f7-a0a7-e05a32bb1753; Permission denied^@.
13 May 2020 20:22:29 (pool2) [] Exception caught on channel 8f10417f: org.dcache.xrootd.core.XrootdException: Authentication required^@.
13 May 2020 20:22:29 (pool2) [door:xrootd-gsi@dCacheDomain:AAWli6mVvCg xrootd-gsi PoolAcceptFile 00007BE3CFDC7B674152964D2DC32D243566] Transfer failed in post-processing: File size mismatch (expected=1000000000, actual=0)
13 May 2020 20:23:12 (pool2) [cleaner PoolRemoveFiles] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Pool restart required: Internal repository error
13 May 2020 20:23:12 (pool2) [cleaner PoolRemoveFiles] Pool: pool2, fault occurred in repository: Internal repository error. Pool restart required: , cause: java.lang.IllegalArgumentException: Cannot set used space to a negative value.
13 May 2020 20:23:12 (pool2) [cleaner PoolRemoveFiles] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Pool restart required: Internal repository error
13 May 2020 20:23:12 (pool2) [cleaner PoolRemoveFiles] Pool: pool2, fault occurred in repository: Internal repository error. Pool restart required: , cause: java.lang.IllegalArgumentException: Cannot set used space to a negative value.
13 May 2020 20:23:12 (pool2) [cleaner PoolRemoveFiles] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Pool restart required: Internal repository error
13 May 2020 20:23:12 (pool2) [cleaner PoolRemoveFiles] Pool: pool2, fault occurred in repository: Internal repository error. Pool restart required: , cause: java.lang.IllegalArgumentException: Cannot set used space to a negative value.
13 May 2020 20:23:40 (pool2) [] Pool mode prevents flushing to nearline storage.
13 May 2020 20:24:40 (pool2) [] Pool mode prevents flushing to nearline storage.

Worryingly, the problem quickly reappeared after prometheus was reinstalled today. Each day at 06:00, all traces of dCache on dCache are removed; on the machine's filesytem and in the database. After this, dCache is installed. Despite this, the problem quickly reappeared (less than three minutes!).

2020-05-14 06:00:34 Process terminated

2020-05-14 06:02:15 Launching /bin/java -server -Xmx512m -XX:MaxDirectMemorySize=512m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.ftp.log-aborted-transfers=true -Dorg.dcache.net.tcp.portrange=33115:33145 -Djava.security.krb5.realm= -Djava.security.krb5.kdc= -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/gss.conf -Dcontent.types.user.table=/etc/dcache/content-types.properties -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/pools-oom.hprof -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.9.2.jar -Djava.awt.headless=true -DwantLog4jSetup=n -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start pools
INFO  - prometheus.conf:131: Property prometheus.java.options.extra is not a standard property
14 May 2020 06:02:31 (System) [] Client environment:zookeeper.version=3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built on 02/10/2020 11:30 GMT
14 May 2020 06:02:31 (System) [] Client environment:host.name=prometheus.desy.de
[...]
14 May 2020 06:02:57 (pool2) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
14 May 2020 06:02:58 (pool2) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
14 May 2020 06:02:58 (pool2) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/scratch/pools/pool2/data;meta=/scratch/pools/pool2/meta]]].
14 May 2020 06:02:58 (pool2) [] Pool mode changed to enabled
14 May 2020 06:05:21 (pool1) [] Exception caught on channel a18ee027: org.dcache.xrootd.core.XrootdException: Authentication to lxfsre05a03.cern.ch:1095 failed; all protocols have been tried..
14 May 2020 06:05:21 (pool1) [] Exception caught on channel a18ee027: org.dcache.xrootd.core.XrootdException: Invalid request; user not authenticated^@.
14 May 2020 06:05:21 (pool1) [door:xrootd-gsi@dCacheDomain:AAWlk84JDwg xrootd-gsi PoolAcceptFile 0000EC04D328D318401E872A89F9692644A1] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Pool restart required: Internal repository error
14 May 2020 06:05:21 (pool1) [door:xrootd-gsi@dCacheDomain:AAWlk84JDwg xrootd-gsi PoolAcceptFile 0000EC04D328D318401E872A89F9692644A1] Pool: pool1, fault occurred in repository: Internal repository error. Pool restart required: , cause: java.lang.IllegalArgumentException: Cannot set used space to a negative value.

This suggests there is some client activity that is triggering this problem.

Here's the billing log leading up to the problem:

05.14 06:05:17 [door:webdav-secure-web@dCacheDomain:request] ["uploader":0:0:2001:638:700:1005:0:0:1:95] [00002BFA2541D8494878A7CD90DEFC6CA9E2,385875] [/Music/Atomic cat/Trance 'n Dance/[cover] Atomic cat - Trance 'n Dance.jpg] test:default@osm 83 0 {0:""}
05.14 06:05:18 [pool:pool2:transfer] [00007D7AF82F95C441E48ACB64AF4F8EB608,6464277] [/Music/Atomic cat/Trance 'n Dance/04 - I want to fly (Dance-Trance).mp3] test:default@osm 6464277 157 true {Http-1.1:131.169.5.149:0:webdav-secure-web:dCacheDomain:/Music/Atomic cat/Trance 'n Dance/04 - I want to fly (Dance-Trance).mp3} [door:webdav-secure-web@dCacheDomain:AAWlk84BDEA:1589429118529000] {0:""}
05.14 06:05:18 [door:webdav-secure-web@dCacheDomain:request] ["uploader":0:0:2001:638:700:1005:0:0:1:95] [00007D7AF82F95C441E48ACB64AF4F8EB608,6464277] [/Music/Atomic cat/Trance 'n Dance/04 - I want to fly (Dance-Trance).mp3] test:default@osm 348 0 {0:""}
05.14 06:05:21 [pool:pool1@pools:remove] [0000EC04D328D318401E872A89F9692644A1,1000000000] [Unknown] test:default@osm {0:"Transfer failed and replica is empty"}
05.14 06:05:21 [pool:pool1:transfer] [0000EC04D328D318401E872A89F9692644A1,1000000000] [/VOs/dteam/tpc-test/root/domatest/file4_ca5c2be1-2c57-4e11-91b7-b6f7be6a2f1a] test:default@osm 0 1521 true {Xrootd-5.0:137.138.62.124:1094} [door:xrootd-gsi@dCacheDomain:AAWlk84JDwg:1589429119609000] {10004:"Post-processing failed: File size mismatch (expected=1000000000, actual=0)"}
05.14 06:05:21 [door:xrootd-gsi@dCacheDomain:request] ["/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tbeerman/CN=722011/CN=Thomas Beermann":1001:1001:2001:1458:201:e4:0:0:100:578] [0000EC04D328D318401E872A89F9692644A1,1000000000] [/VOs/dteam/tpc-test/root/domatest/file4_ca5c2be1-2c57-4e11-91b7-b6f7be6a2f1a] test:default@osm 2049 0 {10004:"Post-processing failed: File size mismatch (expected=1000000000, actual=0)"}
05.14 06:05:21 [door:xrootd-gsi@dCacheDomain:remove] ["/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tbeerman/CN=722011/CN=Thomas Beermann":1001:1001:unknown] [0000EC04D328D318401E872A89F9692644A1,0] [/VOs/dteam/tpc-test/root/domatest/file4_ca5c2be1-2c57-4e11-91b7-b6f7be6a2f1a] <Unknown> 0 0 {0:""}
05.14 06:05:25 [pool:pool2:transfer] [0000BD349EE80B3345C58F1BE3D3117CFAC3,270024111] [/Video/Talks/2016-04-05 - ITT 2016 - Kevlin Henney - Seven Ineffective Coding Habits of Many Programmers.mp4] test:default@osm 270024111 5290 true {Http-1.1:131.169.5.149:0:webdav-secure-web:dCacheDomain:/Video/Talks/2016-04-05 - ITT 2016 - Kevlin Henney - Seven Ineffective Coding Habits of Many Programmers.mp4} [door:webdav-secure-web@dCacheDomain:AAWlk84Ph0g:1589429119575000] {0:""}

I think this xrootd transfer is perhaps significant:

05.14 06:05:21 [pool:pool1@pools:remove] [0000EC04D328D318401E872A89F9692644A1,1000000000] [Unknown] test:default@osm {0:"Transfer failed and replica is empty"}
05.14 06:05:21 [pool:pool1:transfer] [0000EC04D328D318401E872A89F9692644A1,1000000000] [/VOs/dteam/tpc-test/root/domatest/file4_ca5c2be1-2c57-4e11-91b7-b6f7be6a2f1a] test:default@osm 0 1521 true {Xrootd-5.0:137.138.62.124:1094} [door:xrootd-gsi@dCacheDomain:AAWlk84JDwg:1589429119609000] {10004:"Post-processing failed: File size mismatch (expected=1000000000, actual=0)"}
05.14 06:05:21 [door:xrootd-gsi@dCacheDomain:request] ["/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tbeerman/CN=722011/CN=Thomas Beermann":1001:1001:2001:1458:201:e4:0:0:100:578] [0000EC04D328D318401E872A89F9692644A1,1000000000] [/VOs/dteam/tpc-test/root/domatest/file4_ca5c2be1-2c57-4e11-91b7-b6f7be6a2f1a] test:default@osm 2049 0 {10004:"Post-processing failed: File size mismatch (expected=1000000000, actual=0)"}
05.14 06:05:21 [door:xrootd-gsi@dCacheDomain:remove] ["/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tbeerman/CN=722011/CN=Thomas Beermann":1001:1001:unknown] [0000EC04D328D318401E872A89F9692644A1,0] [/VOs/dteam/tpc-test/root/domatest/file4_ca5c2be1-2c57-4e11-91b7-b6f7be6a2f1a] <Unknown> 0 0 {0:""}
kofemann commented 4 years ago

Looks like the problem on prometheus indeed a regression introduced with https://github.com/dCache/dcache/commit/45e4221b12d47ceab6f6bff8b51b9adc9f69945c. However, it desn't explain the original issue.

kofemann commented 4 years ago

@onnozweers Du you know which metadata repository was configured on your pools?

onnozweers commented 4 years ago

@onnozweers Du you know which metadata repository was configured on your pools?

pool.plugins.meta=org.dcache.pool.repository.meta.db.BerkeleyDBMetaDataRepository (in dcache.conf and in the layout files)

kofemann commented 4 years ago

Thanks.