Open onnozweers opened 4 years ago
Not sure. It is possible for a momentary negative value because the check being done is not atomic. However, it looks like the error propagates to the repository and causes a crash, so it may not be ephemeral.
I will take a closer look at the code; in the meantime, can I ask you: have you modified the reclaim property on the sweeper (i.e., > 0) ?
e.g.,
pool.limits.sweeper-margin=0.30
?
First question : what version of dCache is it? And you performed massive removal. Also, if possible can you capture
info -a rep ls
if this happens again
We have no specific sweeper values like pool.limits.sweeper-margin
.
dCache 5.2.4, but pools have been dying from time to time for a while now. We even have a cron job that resurrects dead pools.
The pool I checked is empty now. I found a pool in another poolgroup that had the same warning in its log: "Used space is less than the sum of precious and removable space (this may be a temporary problem - if it persists then please report it to support@dcache.org)." But this pool crashed around noon, was restarted by our cron job and now it doesn't generate that warning. Still, here is the info for that pool:
--- csm (Checksum module) ---
Checksum type : adler32
--- default-transfer-service (Generic transfer service) ---
--- flush (Controller for centralising flushing) ---
Flush interval : 60000 ms
Maximum classes flushing : 1000
Minimum flush delay on error : 60000 ms
Next flush : Tue Nov 12 22:39:44 CET 2019
--- jtm (Job timeout manager) ---
gridftp_queue (lastAccess=3600;total=14400)
regular (lastAccess=72000;total=604800)
p2p (lastAccess=0;total=0)
--- migration (Replica migration module client) ---
--- migration-server (Replica migration module backend) ---
--- nfs-transfer-service (NFSv4.1 transfer service) ---
--- pool (Main pool component) ---
Base directory : /space/projectmine/stage/pool
Version : 5.2.4(5.2.4) (Sub=4)
Report remove : ON
Pool Mode : enabled
Clean prec. files : ON
Hsm Load Suppr. : OFF
Ping Heartbeat : 30 seconds
Breakeven : 0.7
LargeFileStore : NONE
DuplicateRequests : NONE
P2P File Mode : CACHED
Mover Queue (gridftp_queue) 1(12)/0
Mover Queue (regular) 0(100)/0
--- post-transfer-service (Post processing execution service) ---
--- pp (Pool to pool transfer manager) ---
--- queue (HSM flush queue manager) ---
Classes : 0
Requests : 0
--- remote-gsiftp-transfer-service ---
--- remote-http-transfer-service ---
--- rep (Repository manager) ---
State : OPEN
Files : 1852
Disk space
Total : 5069167526388
Used : 5044314230126 [0.9950971641531624]
Free : 24853296262 Gap : 118111600640
Precious : 0 [0.0]
Removable: 4307444170540 [0.8497340338659589]
File system
Size : 84006349570048
Free : 13732391665664 [0.1634684965594578]
Limits for maximum disk space
File system : 18776705895790
Statically configured: 4906003988480
Runtime configured : 4906003988480
--- storagehandler (Nearline storage manager) ---
Restore Timeout : 604800 seconds
Store Timeout : 14400 seconds
Remove Timeout : 14400 seconds
Job Queues (active/queued)
to store 0/0
from store 2/0
delete 0/0
I have to add, that sometimes a stage pool tries to kill a restore that has timed out, but that kill usually fails (because of complications in our HSM script we still need to fix). Our HSM may then proceed to stage a file which is then not properly registered. That might cause the pool dir to contain more data than is registered. I'm not sure whether this is related.
The 4 pools I mentioned were completely drained, they don't have any files left (which is what I expected after my cleanup).
If possible, please capture this info while pool is complaining about negative space. If you sum up the sizes of 1852 will it match the reported (used, precious and removable).
do numbers seen in above output
Used : 5044314230126 [0.9950971641531624]
Precious : 0 [0.0]
Removable: 4307444170540
are consistent with rep ls -s -sum
output. Have you been using 3-rd party transfers?
I've set up a cron job that warns me whenever a pool logs "Used space is less than the sum of precious and removable space". At the moment there are none.
For the data I cleaned up (the 4 pools), I don't know if that has been written with 3rd party transfers. Some of this data was in a generated
dir, so I guess it comes from an LFC. For the 5th pool I know that some data has been uploaded with 3rd party transfers.
For the fifth pool, here is a fresh info -a
& rep ls -s -sum
(though the pool is not complaining about used space at the moment).
--- rep (Repository manager) ---
State : OPEN
Files : 1851
Disk space
Total : 5069167526388
Used : 5041554143855 [0.9945526790366946]
Free : 27613382533 Gap : 118111600640
Precious : 0 [0.0]
Removable: 4212548234191 [0.8310138128720757]
File system
Size : 84006349570048
Free : 13584122630144 [0.1617035224083507]
Limits for maximum disk space
File system : 18625676773999
Statically configured: 4906003988480
Runtime configured : 4906003988480
total 5069167526388 27613382533 4212548234191 0 0 0 0 0
projectmine:tape@osm 5041554143855 1851 0 0 829005909664 29 4212548234191 1822
Bingo.
--- rep (Repository manager) ---
State : OPEN
Files : 869
Disk space
Total : 10925888870810
Used : 10179842201600 [0.9317175309001023]
Free : 746046669210 Gap : 2199023255552
Precious : 0 [0.0]
Removable: 10271700326400 [0.940124913208869]
File system
Size : 219444922023936
Free : 88193080348672 [0.40189164340313294]
Limits for maximum disk space
File system : 98372922550272
Statically configured: 10925888870810
Runtime configured : 10925888870810
total 10925888870810 746046669210 10271700326400 0 0 0 0 0
lofops:tape@osm 11115629404160 869 0 0 843929077760 121 10271700326400 748
Onno,
Comments? So indeed used - (precious + removable) < 0 :
>>> 10179842201600 - 10271700326400
-91858124800
and rep ls
matches output.
What happens if you restart the pool. A assume repository re-syncs itself and numbers get corrected? Can you do it to confirm.
It looks like there is some issue in the way used space is calculated. Also, it did not occur to me, if you were to go to pool data area and sum up all replica sizes will it match 10271700326400 and are there exact 869 files?
I have never seen this error before. As far as trying to reproduce this. All one need to do is to remove a bunch of replicas from the pool is that correct?
Dmitry
I have observed this failure on prometheus:
I've copied the domain log file leading up to the problem below
13 May 2020 20:07:42 (pool2) [] Exception caught on channel 92b6479f: org.dcache.xrootd.core.XrootdException: Authentication to p06636710r84969.cern.ch:1095 failed; all protocols have been tried..
13 May 2020 20:07:42 (pool2) [] Exception caught on channel 92b6479f: org.dcache.xrootd.core.XrootdException: Invalid request; user not authenticated^@.
13 May 2020 20:07:43 (pool2) [door:xrootd-gsi@dCacheDomain:AAWli3S+Fgg xrootd-gsi PoolAcceptFile 000049A14E8699694D958EF7BB785530EF59] Transfer failed in post-processing: File size mismatch (expected=1000000000, actual=0)
13 May 2020 20:07:43 (pool2) [] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:715)
at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:762)
[...]
at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1089)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at org.dcache.util.CDCThreadFactory.lambda$newThread$0(CDCThreadFactory.java:50)
at java.base/java.lang.Thread.run(Thread.java:834)
13 May 2020 20:13:33 (pool2) [door:webdav-secure-grid@dCacheDomain:AAWli4m09AA RemoteTransferManager PoolDeliverFile 0000B8C1B06F8D124909BB7027622467C6BF] Transfer failed: rejected PUT: 400 Bad Request
13 May 2020 20:13:43 (pool2) [] Exception caught on channel c0dd5571: org.dcache.xrootd.core.XrootdException: Authentication to lxfsre09a02.cern.ch:1095 failed; all protocols have been tried..
13 May 2020 20:13:43 (pool2) [] Exception caught on channel c0dd5571: org.dcache.xrootd.core.XrootdException: Invalid request; user not authenticated^@.
13 May 2020 20:13:43 (pool1) [] Exception caught on channel 6793c0d3: org.dcache.xrootd.core.XrootdException: Authentication required^@.
13 May 2020 20:13:43 (pool2) [door:xrootd-gsi@dCacheDomain:AAWli4pCOBA xrootd-gsi PoolAcceptFile 000000D91FC66BCD48C18BF6973CFCC63E86] Transfer failed in post-processing: File size mismatch (expected=1000000000, actual=0)
13 May 2020 20:13:44 (pool1) [door:xrootd-gsi@dCacheDomain:AAWli4pDhBg xrootd-gsi PoolAcceptFile 0000E6C208C44F6D44C283F2E983135326CF] Transfer failed in post-processing: File size mismatch (expected=1000000000, actual=0)
13 May 2020 20:14:22 (pool2) [door:webdav-secure-grid@dCacheDomain:AAWli4yrf4A RemoteTransferManager PoolAcceptFile 0000166BF880780D44F3B611D74F2BBC02FD] Transfer failed: rejected GET: 403 Forbidden
13 May 2020 20:20:12 (pool1) [door:webdav-secure-grid@dCacheDomain:AAWli6F2/AA RemoteTransferManager PoolAcceptFile 00006BB2AC6A2FB544B6A986A0EB6F8DFFF9] Transfer failed: The peer's certificate with subject's DN CN=xrootd-local.unl.edu,OU=Holland Computing Center,O=University of Nebraska-Lincoln,L=Lincoln,ST=NE,C=US,DC=incommon,DC=org was rejected. The peer's certificate status is: FAILED The following validation errors were found:;error at position 2 in chain, problematic certificate subject: CN=COMODO RSA Certification Authority,O=COMODO CA Limited,L=Salford,ST=Greater Manchester,C=GB (category: NAMESPACE): Namespace definition for the certificate issuer (CN=AddTrust External CA Root,OU=AddTrust External TTP Network,O=AddTrust AB,C=SE) is not defined, and namespaces are configured to be required.
13 May 2020 20:21:54 (pool1) [] Exception caught on channel d7e9e089: org.dcache.xrootd.core.XrootdException: Unable to open /dpm/cern.ch/home/dteam/domatests/root/domatest/file9_bcaef027-d26d-44f7-a0a7-e05a32bb1753; Permission denied^@.
13 May 2020 20:22:29 (pool2) [] Exception caught on channel 8f10417f: org.dcache.xrootd.core.XrootdException: Authentication required^@.
13 May 2020 20:22:29 (pool2) [door:xrootd-gsi@dCacheDomain:AAWli6mVvCg xrootd-gsi PoolAcceptFile 00007BE3CFDC7B674152964D2DC32D243566] Transfer failed in post-processing: File size mismatch (expected=1000000000, actual=0)
13 May 2020 20:23:12 (pool2) [cleaner PoolRemoveFiles] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Pool restart required: Internal repository error
13 May 2020 20:23:12 (pool2) [cleaner PoolRemoveFiles] Pool: pool2, fault occurred in repository: Internal repository error. Pool restart required: , cause: java.lang.IllegalArgumentException: Cannot set used space to a negative value.
13 May 2020 20:23:12 (pool2) [cleaner PoolRemoveFiles] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Pool restart required: Internal repository error
13 May 2020 20:23:12 (pool2) [cleaner PoolRemoveFiles] Pool: pool2, fault occurred in repository: Internal repository error. Pool restart required: , cause: java.lang.IllegalArgumentException: Cannot set used space to a negative value.
13 May 2020 20:23:12 (pool2) [cleaner PoolRemoveFiles] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Pool restart required: Internal repository error
13 May 2020 20:23:12 (pool2) [cleaner PoolRemoveFiles] Pool: pool2, fault occurred in repository: Internal repository error. Pool restart required: , cause: java.lang.IllegalArgumentException: Cannot set used space to a negative value.
13 May 2020 20:23:40 (pool2) [] Pool mode prevents flushing to nearline storage.
13 May 2020 20:24:40 (pool2) [] Pool mode prevents flushing to nearline storage.
Worryingly, the problem quickly reappeared after prometheus was reinstalled today. Each day at 06:00, all traces of dCache on dCache are removed; on the machine's filesytem and in the database. After this, dCache is installed. Despite this, the problem quickly reappeared (less than three minutes!).
2020-05-14 06:00:34 Process terminated
2020-05-14 06:02:15 Launching /bin/java -server -Xmx512m -XX:MaxDirectMemorySize=512m -Dsun.net.inetaddr.ttl=1800 -Dorg.globus.tcp.port.range=20000,25000 -Dorg.dcache.dcap.port=0 -Dorg.dcache.ftp.log-aborted-transfers=true -Dorg.dcache.net.tcp.portrange=33115:33145 -Djava.security.krb5.realm= -Djava.security.krb5.kdc= -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=/etc/dcache/gss.conf -Dcontent.types.user.table=/etc/dcache/content-types.properties -Dzookeeper.sasl.client=false -Dcurator-dont-log-connection-problems=true -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dcache/pools-oom.hprof -XX:+UseCompressedOops -javaagent:/usr/share/dcache/classes/aspectjweaver-1.9.2.jar -Djava.awt.headless=true -DwantLog4jSetup=n -Ddcache.home=/usr/share/dcache -Ddcache.paths.defaults=/usr/share/dcache/defaults org.dcache.boot.BootLoader start pools
INFO - prometheus.conf:131: Property prometheus.java.options.extra is not a standard property
14 May 2020 06:02:31 (System) [] Client environment:zookeeper.version=3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built on 02/10/2020 11:30 GMT
14 May 2020 06:02:31 (System) [] Client environment:host.name=prometheus.desy.de
[...]
14 May 2020 06:02:57 (pool2) [] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Awaiting initialization
14 May 2020 06:02:58 (pool2) [] Pool mode changed to disabled(store,stage,p2p-client,loading): Loading...
14 May 2020 06:02:58 (pool2) [] Reading inventory from Inotify[Checksum[IoStatistics[data=/scratch/pools/pool2/data;meta=/scratch/pools/pool2/meta]]].
14 May 2020 06:02:58 (pool2) [] Pool mode changed to enabled
14 May 2020 06:05:21 (pool1) [] Exception caught on channel a18ee027: org.dcache.xrootd.core.XrootdException: Authentication to lxfsre05a03.cern.ch:1095 failed; all protocols have been tried..
14 May 2020 06:05:21 (pool1) [] Exception caught on channel a18ee027: org.dcache.xrootd.core.XrootdException: Invalid request; user not authenticated^@.
14 May 2020 06:05:21 (pool1) [door:xrootd-gsi@dCacheDomain:AAWlk84JDwg xrootd-gsi PoolAcceptFile 0000EC04D328D318401E872A89F9692644A1] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server,dead): Pool restart required: Internal repository error
14 May 2020 06:05:21 (pool1) [door:xrootd-gsi@dCacheDomain:AAWlk84JDwg xrootd-gsi PoolAcceptFile 0000EC04D328D318401E872A89F9692644A1] Pool: pool1, fault occurred in repository: Internal repository error. Pool restart required: , cause: java.lang.IllegalArgumentException: Cannot set used space to a negative value.
This suggests there is some client activity that is triggering this problem.
Here's the billing log leading up to the problem:
05.14 06:05:17 [door:webdav-secure-web@dCacheDomain:request] ["uploader":0:0:2001:638:700:1005:0:0:1:95] [00002BFA2541D8494878A7CD90DEFC6CA9E2,385875] [/Music/Atomic cat/Trance 'n Dance/[cover] Atomic cat - Trance 'n Dance.jpg] test:default@osm 83 0 {0:""}
05.14 06:05:18 [pool:pool2:transfer] [00007D7AF82F95C441E48ACB64AF4F8EB608,6464277] [/Music/Atomic cat/Trance 'n Dance/04 - I want to fly (Dance-Trance).mp3] test:default@osm 6464277 157 true {Http-1.1:131.169.5.149:0:webdav-secure-web:dCacheDomain:/Music/Atomic cat/Trance 'n Dance/04 - I want to fly (Dance-Trance).mp3} [door:webdav-secure-web@dCacheDomain:AAWlk84BDEA:1589429118529000] {0:""}
05.14 06:05:18 [door:webdav-secure-web@dCacheDomain:request] ["uploader":0:0:2001:638:700:1005:0:0:1:95] [00007D7AF82F95C441E48ACB64AF4F8EB608,6464277] [/Music/Atomic cat/Trance 'n Dance/04 - I want to fly (Dance-Trance).mp3] test:default@osm 348 0 {0:""}
05.14 06:05:21 [pool:pool1@pools:remove] [0000EC04D328D318401E872A89F9692644A1,1000000000] [Unknown] test:default@osm {0:"Transfer failed and replica is empty"}
05.14 06:05:21 [pool:pool1:transfer] [0000EC04D328D318401E872A89F9692644A1,1000000000] [/VOs/dteam/tpc-test/root/domatest/file4_ca5c2be1-2c57-4e11-91b7-b6f7be6a2f1a] test:default@osm 0 1521 true {Xrootd-5.0:137.138.62.124:1094} [door:xrootd-gsi@dCacheDomain:AAWlk84JDwg:1589429119609000] {10004:"Post-processing failed: File size mismatch (expected=1000000000, actual=0)"}
05.14 06:05:21 [door:xrootd-gsi@dCacheDomain:request] ["/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tbeerman/CN=722011/CN=Thomas Beermann":1001:1001:2001:1458:201:e4:0:0:100:578] [0000EC04D328D318401E872A89F9692644A1,1000000000] [/VOs/dteam/tpc-test/root/domatest/file4_ca5c2be1-2c57-4e11-91b7-b6f7be6a2f1a] test:default@osm 2049 0 {10004:"Post-processing failed: File size mismatch (expected=1000000000, actual=0)"}
05.14 06:05:21 [door:xrootd-gsi@dCacheDomain:remove] ["/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tbeerman/CN=722011/CN=Thomas Beermann":1001:1001:unknown] [0000EC04D328D318401E872A89F9692644A1,0] [/VOs/dteam/tpc-test/root/domatest/file4_ca5c2be1-2c57-4e11-91b7-b6f7be6a2f1a] <Unknown> 0 0 {0:""}
05.14 06:05:25 [pool:pool2:transfer] [0000BD349EE80B3345C58F1BE3D3117CFAC3,270024111] [/Video/Talks/2016-04-05 - ITT 2016 - Kevlin Henney - Seven Ineffective Coding Habits of Many Programmers.mp4] test:default@osm 270024111 5290 true {Http-1.1:131.169.5.149:0:webdav-secure-web:dCacheDomain:/Video/Talks/2016-04-05 - ITT 2016 - Kevlin Henney - Seven Ineffective Coding Habits of Many Programmers.mp4} [door:webdav-secure-web@dCacheDomain:AAWlk84Ph0g:1589429119575000] {0:""}
I think this xrootd transfer is perhaps significant:
05.14 06:05:21 [pool:pool1@pools:remove] [0000EC04D328D318401E872A89F9692644A1,1000000000] [Unknown] test:default@osm {0:"Transfer failed and replica is empty"}
05.14 06:05:21 [pool:pool1:transfer] [0000EC04D328D318401E872A89F9692644A1,1000000000] [/VOs/dteam/tpc-test/root/domatest/file4_ca5c2be1-2c57-4e11-91b7-b6f7be6a2f1a] test:default@osm 0 1521 true {Xrootd-5.0:137.138.62.124:1094} [door:xrootd-gsi@dCacheDomain:AAWlk84JDwg:1589429119609000] {10004:"Post-processing failed: File size mismatch (expected=1000000000, actual=0)"}
05.14 06:05:21 [door:xrootd-gsi@dCacheDomain:request] ["/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tbeerman/CN=722011/CN=Thomas Beermann":1001:1001:2001:1458:201:e4:0:0:100:578] [0000EC04D328D318401E872A89F9692644A1,1000000000] [/VOs/dteam/tpc-test/root/domatest/file4_ca5c2be1-2c57-4e11-91b7-b6f7be6a2f1a] test:default@osm 2049 0 {10004:"Post-processing failed: File size mismatch (expected=1000000000, actual=0)"}
05.14 06:05:21 [door:xrootd-gsi@dCacheDomain:remove] ["/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=tbeerman/CN=722011/CN=Thomas Beermann":1001:1001:unknown] [0000EC04D328D318401E872A89F9692644A1,0] [/VOs/dteam/tpc-test/root/domatest/file4_ca5c2be1-2c57-4e11-91b7-b6f7be6a2f1a] <Unknown> 0 0 {0:""}
Looks like the problem on prometheus indeed a regression introduced with https://github.com/dCache/dcache/commit/45e4221b12d47ceab6f6bff8b51b9adc9f69945c. However, it desn't explain the original issue.
@onnozweers Du you know which metadata repository was configured on your pools?
@onnozweers Du you know which metadata repository was configured on your pools?
pool.plugins.meta=org.dcache.pool.repository.meta.db.BerkeleyDBMetaDataRepository (in dcache.conf and in the layout files)
Thanks.
Dear dCache devs,
I cleaned up lots of files from a VO, and then 4 pools in their poolgroup died. This is in the log of one of them:
Any ideas what could cause this?