Open ageorget opened 3 months ago
Hi,
We have commented this issue today in the dCache Tier-1 Meeting with @lemora and @kofemann. On Monday, our CMS representative complained that the SAM tests for Xrootd were failing, experiencing timeouts. But they were not failing all the time, but about 50% of the time. The first thing we found in the logs, is the TLS handshake errors you have commented. But, @XMol and I, after searching a lot for culprits, found a clear correlation with the start of the failing CMS Xrootd SAM tests. This was a sudden spike in the number of TCP connections in CLOSE_WAIT state on the Xrootd doors. After restarting the doors, the hanging TCP sockets were killed and the SAM Xrootd tests recovered. Why did it happen? Maybe a client sending too many transfers, a network glitch ... We do not know and we have been unable to find a clear reason in the logs. A detailed explanation, clearly indicating that sockets in CLOSE_WAIT state and timeouts in incoming connections are correlated, can be found in this article from 2016. So, @ageorget could you please verify that you do not have long lasting TCP sockets in CLOSE_WAIT state on your Xrootd doors?
I forgot to mention: dCache 9.2.22, openjdk 17 and RHEL-8.
Cheers, Samuel
Hi Samuel,
Good to know, we also experienced timeouts with CMS SAM tests for Xrootd since yesterday afternoon and same observation, 50% of SAM tests failed with timeouts
Checking read of SAM dataset test file
root://u04@ccxrdcms.in2p3.fr:1094//pnfs/in2p3.fr/data/cms/disk/data/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/CE860B10-5D76-E711-BCA8-FA163EAA761A.root
[E] xrootd file open error, XRootDStatus.code=306 "[FATAL] Redirect limit has been reached"
https://monit-grafana.cern.ch/goto/UgDTDSrIg?orgId=20
But I don't know if it's related with this SSL/TLS issue because I found these errors in the Xrootd door log since the end of May...
Stephan from CMS just answered that the ETF team may have picked up newer, broken xrootd client version yesterday. They are trying to downgrade it to fix the SAM tests.
Adrien
Hi Adrien,
I am not sure, because we had a Full Site Downtime last week and we reinstalled all dCache servers hosting the doors, and we do not save the door logs, but I think we have also seen these TLS handshake errors in the past on the doors. Do you happen to have historical access to TCP states on the Xrootd doors? Just to see if you also have hanging sockets with CLOSE_WAIT status during the time, the SAM tests were experiencing timeouts. As commented before, we do not know the reason for it, but your comment about a broken xrootd client, could help identify the origin of the problem ;-)
Cheers, Samuel
Do you happen to have historical access to TCP states on the Xrootd doors? Just to see if you also have hanging sockets with CLOSE_WAIT status during the time, the SAM tests were experiencing timeouts.
No I don't see any suspect CLOSE_WAIT TCP connections on the xrootd doors :
Just to add this discussin. We see the same errors from etf-07.cern.ch hosts. We also see spikes of CLOSE_WAIT connections during which xrootd door uses up all available thread workers I am investigating
We upgraded to Alma9. And deployed 9.2.22. We have not seen CLOSE_WAIT connections in large numbers before upgrade.
Another bit - the "no SSL/TLS" errors definitely come from SAM test and do not seem to correlate with CLOSE_WAIT situation. I started to monitor threads used by XrootD and spikes occur about may be once in 2 days whereas these "no SSL/TLS" errors occur more frequent. To me it looks like server "thinks" that client is not using SSL connection. (like say you would use http instead of https in case if WebDAV).
We do not restart the door. It recoveres by itself.
Dmitry
The strange thing is that access log reports the transaction as OK and nothing else:
level=INFO ts=2024-07-31T14:10:40.712+0200 event=org.dcache.xrootd.connection.start session=door:xrootd-ccdcacli501@xrootd-ccdcacli501Domain:AAYeifuYzUA socket.remote=128.142.196.239:15788 socket.local=134.158.209.241:1094 level=INFO ts=2024-07-31T14:10:40.714+0200 event=org.dcache.xrootd.request session=door:xrootd-ccdcacli501@xrootd-ccdcacli501Domain:AAYeifuYzUA request=protocol response=ok level=INFO ts=2024-07-31T14:10:40.719+0200 event=org.dcache.xrootd.connection.end session=door:xrootd-ccdcacli501@xrootd-ccdcacli501Domain:AAYeifuYzUA
Adrien
When connection to xrootd door is made there are few things being exchanged. I think all that means is that protocol call has succeeded. But then the server threw exception causing no proper entry in the log.
This is time distrubution of errors on xrootd door on CMS site:
select date_trunc('hour', datestamp) as time, count(*), sum(case when errorcode <> 0 then 1 else 0 end) as errors from billinginfo where datestamp > current_date and initiator like 'door:Xrootd-%' group by time order by time asc;
time | count | errors
------------------------+-------+--------
2024-07-31 00:00:00-05 | 10045 | 19
2024-07-31 01:00:00-05 | 10782 | 9
2024-07-31 02:00:00-05 | 8972 | 15
2024-07-31 03:00:00-05 | 12413 | 22
2024-07-31 04:00:00-05 | 15987 | 17
2024-07-31 05:00:00-05 | 15684 | 21
2024-07-31 06:00:00-05 | 20845 | 23
2024-07-31 07:00:00-05 | 16651 | 16
2024-07-31 08:00:00-05 | 12087 | 18
2024-07-31 09:00:00-05 | 12716 | 19
2024-07-31 10:00:00-05 | 18714 | 14
2024-07-31 11:00:00-05 | 2398 | 52
(12 rows)
We experienced spike in thread count on Xrootd door at about 2 AM. All errors between 2 AM and 4 AM looked like:
connectiontime | errormessage
----------------+-----------------------------------------------------------------------------------
1572899 | Client disconnected without closing file.
5285789 | Client disconnected without closing file.
2066893 | Client disconnected without closing file.
3363181 | Post-processing failed: File size mismatch (expected=1014285505, actual=92274688)
3913444 | Client disconnected without closing file.
2919113 | Client disconnected without closing file.
2729012 | Client disconnected without closing file.
3629109 | Client disconnected without closing file.
5069413 | Client disconnected without closing file.
300000 | No connection from client after 300 seconds. Giving up.
3937066 | Client disconnected without closing file.
867855 | Client disconnected without closing file.
3048245 | Client disconnected without closing file.
2413614 | Client disconnected without closing file.
6421148 | Client disconnected without closing file.
2246886 | Client disconnected without closing file.
104141 | Client disconnected without closing file.
2622108 | Post-processing failed: File size mismatch (expected=104937297, actual=50331648)
1927925 | Post-processing failed: File size mismatch (expected=1014285505, actual=41943040)
997368 | Client disconnected without closing file.
4115686 | Post-processing failed: File size mismatch (expected=303559167, actual=100663296)
3136051 | Post-processing failed: File size mismatch (expected=3996099675, actual=67108864)
2230348 | Post-processing failed: File size mismatch (expected=129602261, actual=50331648)
5012781 | Post-processing failed: File size mismatch (expected=303513742, actual=117440512)
2040690 | Post-processing failed: File size mismatch (expected=366576405, actual=33554432)
430277 | Client disconnected without closing file.
1143861 | Client disconnected without closing file.
1312073 | Client disconnected without closing file.
2132199 | Client disconnected without closing file.
212419 | Client disconnected without closing file.
291383 | Client disconnected without closing file.
2243652 | Post-processing failed: File size mismatch (expected=3984752381, actual=41943040)
1905640 | Post-processing failed: File size mismatch (expected=366576405, actual=41943040)
2311224 | Post-processing failed: File size mismatch (expected=3996099675, actual=58720256)
1921058 | Post-processing failed: File size mismatch (expected=91600215, actual=50331648)
3827332 | Post-processing failed: File size mismatch (expected=2213161106, actual=75497472)
(36 rows)
Again, this is NOT related to SSL/TLS issue but, I think this is related to CLOSE_WAIT
https://github.com/DmitryLitvintsev/scripts/blob/master/bash/dcache/dcache_thread_count.sh
I use the above script to monitor thread counts and I trigger thread dump of domain if thread count > 900 plus netstat -apn
(this is because default xrootd.limits.threads = 1000
if you have higher setting in your layout/config adjust the threshold
correspondingly)
I would like you to run this script at your site.
Well, the errors stopped Yesterday afternoon as soon as CMS rollbacked their Xrootd client version I think.
time | count | errors
------------------------+-------+--------
2024-07-31 09:00:00+02 | 869 | 208
2024-07-31 10:00:00+02 | 2358 | 253
2024-07-31 11:00:00+02 | 906 | 243
2024-07-31 12:00:00+02 | 2418 | 192
2024-07-31 13:00:00+02 | 1375 | 112
2024-07-31 14:00:00+02 | 1639 | 139
2024-07-31 15:00:00+02 | 716 | 20
2024-07-31 16:00:00+02 | 1913 | 48
2024-07-31 17:00:00+02 | 5279 | 0
2024-07-31 18:00:00+02 | 1410 | 0
2024-07-31 19:00:00+02 | 1296 | 10
2024-07-31 20:00:00+02 | 2070 | 11
2024-07-31 21:00:00+02 | 1607 | 2
2024-07-31 22:00:00+02 | 1655 | 2
2024-07-31 23:00:00+02 | 1738 | 0
2024-08-01 00:00:00+02 | 5221 | 10
2024-08-01 01:00:00+02 | 5946 | 20
2024-08-01 02:00:00+02 | 6517 | 9
2024-08-01 03:00:00+02 | 5867 | 11
2024-08-01 04:00:00+02 | 5677 | 4
2024-08-01 05:00:00+02 | 4668 | 5
2024-08-01 06:00:00+02 | 3168 | 2
2024-08-01 07:00:00+02 | 2844 | 3
2024-08-01 08:00:00+02 | 1642 | 2
2024-08-01 09:00:00+02 | 1309 | 17
2024-08-01 10:00:00+02 | 1189 | 0
2024-08-01 11:00:00+02 | 1306 | 0
2024-08-01 12:00:00+02 | 1752 | 5
2024-08-01 13:00:00+02 | 1547 | 3
2024-08-01 14:00:00+02 | 1213 | 10
Now I only got errors on doors like Server: TLS handshake failed: java.nio.channels.ClosedChannelException.
Same behaviour at KIT. The errors with TLS handshake failed: not an SSL/TLS record
stopped yesterday around 16:20 h (CEST).
@ageorget @samuambroj Thanks for the update. Nonetheless, we (dcache) should find out what's going on there anyway....
The SSL errors looked liked non-ssl connection made to SSL expecting host. I think we need to handle that exception, so we don't generate stack trace. I will look into it.
The issue with multiple CLOSE_WAITs and peek CPU utilization on the xrootd server for CMS AAA is believed to be caused by a bug in OpenSSL 3.0. (This is a c library to extract voms information at voms_api)
We had a similar issue on our site and found that the xrootd configuration after removing the voms extract part was not generating CLOSE_WAIT sockets.
Related openssl issues and PRs were finalized in July. We are currently testing the openssl 3.2-dev branch instead of latest release.
For one week, we have not experienced close_wait and increased CPU utilization in xrootd.
However, we have recently experienced authentication-related errors in dcache and some users have reported TLS-related errors. Therefore, the CLOSE_WAIT issue and the TLS error issue seem to have different causes.
Hi,
For the last days, we observed more and more SSL/TLS errors on the CMS XRootD doors like this without any changes on dCache side (9.2.22 on doors, Java openjdk version "17.0.10") :
The strange thing is that access log reports the transaction as OK and nothing else:
Adrien