Open amaltaro opened 2 weeks ago
Regarding first item, I scanned APS logs for requests with status not equal to 200, e.g.
grep proxy-resp /cephfs/testbed/aps-logs/auth-proxy-server.log_x509-proxy-server-*20240627 | egrep -v " 200 GET | 200 POST | 200 PUT |no-auth-cert| 200 HEAD | 200 DELETE " | wc -l
0
The no-auth-cert
requests are from CERN IT penetration tests, and as you can see from above I found 0 requests outside of 200 status. The total number of records for XPS (APS running on port 8443 with x509) is 313K. For 20240626 date we have 564K entries in XPS logs and again 0 requests with non 200 status. Therefore, I would declare that APS log checking is complete.
For completeness, the APS logs (port 443) we have 165K entries for 20240627, 67K for 20240626, and 0 requests again with non 200 status.
Also, thanks to Nikodemas we now have initial monitoring dashboard for APS which you can check out over here. We are adding more stuff to it and once completed it will show performance metrics of APS/XPS servers.
Thanks Valentin and Nikodemas! If you have fully validated any of the items in the check-list, feel free to check that out then.
And if you see anything else missing from the WM perspective, please share with us, such that we can add it to the check-list as well.
Alan, here is report from all logs, including services and microservices where I used today's logs (20240627):
reqmon 200 OK
workqueue ok
t0_reqmon_tasks 200 OK
t0_reqmon 200 OK
reqmon-tasks 200 OK
reqmon 200 OK
reqmgr2ms-ms-unmer-t2t3us 200 OK, but I saw
2024-06-27 00:58:50,587:ERROR:MSUnmerged: Could not fetch statistics from Rucio Consistency Monitor. Skipping the current run.
reqmgr2ms-ms-unmer-t2t3 200 OK, but I saw
2024-06-27 00:58:50,587:ERROR:MSUnmerged: Could not fetch statistics from Rucio Consistency Monitor. Skipping the current run.
reqmgr2ms-20240627-ms-unmer-t1 200 OK
reqmgr2ms-20240627-ms-transferor 200 OK, but I saw
2024-06-27 00:02:19,405:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /RelValTTbar_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM. Response: {'url': 'http://xxx-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/RelValTTbar_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Thu, 27 Jun 2024 00:02:19 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: xxx-rucio.cern.ch\r\n\r\n'}
reqmgr2ms-20240627-ms-rulecleaner 200 OK
reqmgr2ms-20240627-ms-pileup-tasks 200 OK
reqmgr2ms-20240627-ms-pileup 200 OK, but
grep -v "200 OK" /cephfs/testbed/dmwm-logs/reqmgr2ms-20240627-ms-pileup-596fb7bdc7-lgrpd.log | grep -i HTTP
...
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSPileup/MSPileupData.py", line 157, in createPileup
obj = MSPileupObj(pdict, validRSEs=rseList)
File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSPileup/DataStructs/MSPileupObj.py", line 71, in __init__
raise Exception(msg)
Exception: MSPileup input is invalid, expectedRSEs value ['T1_US_FNAL_Disk', 'T2_CH_CERN'] is not in validRSEs ['T2_US_Nebraska', '
T2_US_UCSD', 'T2_US_Wisconsin', 'T2_US_Purdue']
....
[27/Jun/2024:18:50:39] ms-pileup-596fb7bdc7-lgrpd 188.185.123.179:28808 "POST /ms-pileup/data/pileup HTTP/1.1" 400 Bad Request [data: 1630 in 931 out 161494 us ] [auth: ok "Robot t1" "" ] [ref: "https://xxx-testbed.cern.ch" "PycURL/7.45.2 libcurl/7.74.0 OpenSSL/1.1.1n zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh2/1.9.0 nghttp2/1.43.0 librtmp/2.3" ]
...
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSPileup/MSPileupData.py", line 157, in createPileup
obj = MSPileupObj(pdict, validRSEs=rseList)
File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSPileup/DataStructs/MSPileupObj.py", line 71, in __init__
raise Exception(msg)
Exception: MSPileup input is invalid, expectedRSEs value ['T1_US_FNAL_Disk', 'T2_CH_CERN'] is not in validRSEs ['T2_US_Nebraska', '
T2_US_UCSD', 'T2_US_Wisconsin', 'T2_US_Purdue']
...
[27/Jun/2024:18:52:00] ms-pileup-596fb7bdc7-lgrpd IPAddress:12756 "POST /ms-pileup/data/pileup HTTP/1.1" 400 Bad Request [data: 1622 in 931 out 90621 us ] [auth: ok "Robot t1" "" ] [ref: "https://xxx-testbed.cern.ch" "PycURL/7.45.2 libcurl/7.74.0 OpenSSL/1.1.1n zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh2/1.9.0 nghttp2/1.43.0 librtmp/2.3" ]
reqmgr2ms-20240627-ms-output 200 OK, but there are rucio failures, e.g.
2024-06-27 19:01:26,788:DEBUG:connectionpool: http://xxx-rucio-int.cern.ch:port "GET /rses/?expression=rse_type%3DTAPE%5C%28cms_type%3Dtest%7Crse%3DT1_RU_JINR_Tape%29 HTTP/1.1" 401 106
reqmgr2ms-20240627-ms-monitor 200 OK, but there are rucio failures, e.g.
2024-06-27 18:56:11,471:DEBUG:connectionpool: http://xxx-rucio.cern.ch:80 "GET /rules/ffd9fe4a39ff4b7abe26d8496d66874f HTTP/1.1" 404 116
reqmgr2-20240627-reqmgr2-tasks 200 OK
reqmgr2-20240627-reqmgr2 200 OK, but
[27/Jun/2024:07:57:34] reqmgr2-67b86b4556-gt426 IPAddress:13572 "DELETE /reqmgr2/data/transferinfo/spiga_task_BPH-RunIISummer20UL18GEN-Backfill-00262__v1_T_240622_115214_9893 HTTP/1.1" 411 Length Required [data: 1464 in - out 402 us ] [auth: ok "Robot: WmCore Service Account" "" ] [ref: "https://xxx-testbed.cern.ch" "WMCore.Services.Requests/v002" ]
...
[27/Jun/2024:14:21:58] reqmgr2-67b86b4556-gt426 IPAddress:50420 "GET /reqmgr2/data/wrong_endpoint HTTP/1.1" 404 Not Found [data: 5723 in 723 out 1013 us ] [auth: ok "Robot: CMS Build Bot" "" ] [ref: "https://cmsweb-testbed.cern.ch" "PycURL/7.45.1 libcurl/7.59.0 OpenSSL/1.0.2k zlib/1.2.8" ]
...
[27/Jun/2024:14:21:58] reqmgr2-67b86b4556-gt426 IPAddress:50428 "HEAD /reqmgr2/data/wrong_endpoint HTTP/1.1" 404 Not Found [data: 10123 in 723 out 937 us ] [auth: ok "Robot: CMS Build Bot" "" ] [ref: "https://xxx-testbed.cern.ch" "PycURL/7.45.1 libcurl/7.59.0 OpenSSL/1.0.2k zlib/1.2.8" ]
Bottom line I found only 1 "missing" APIs:
/reqmgr2/data/wrong_endpoint HTTP/1.1" 404 Not Found
and one DELETE API with wrong parameter:
DELETE /reqmgr2/data/transferinfo/spiga_task_BPH-RunIISummer20UL18GEN-Backfill-00262__v1_T_240622_115214_9893 HTTP/1.1" 411 Length Required
In this HTTP call the client didn't provide content length, see 411 Length Required. It seems the issue is WM client which improperly construct HTTP request and APS server rejects it based on HTTP specs.
Now moving slowly with component logs: WMAgent logs:
http.client.HTTPException: url=https://xxxx-testbed.cern.ch:port/couchdb/wmstats/_design/WMStats/_update/agentInfo/vocms0262.cern.ch?data_last_update=1699171475&data_error=ok, code=409,
# the error above indicate conflict as 409 status code means StatusConflict = 409 // RFC 9110, 15.5.10
http.client.HTTPException: url=https://xxxx-testbed.cern.ch:port/couchdb/wmstats/_design/WMStats/_view/byAgentURL?reduce=false&stale=update_after, code=503, reason=Service Unavailable
- ArchiveDataReporter reports the following error:
url=https://cmsweb-testbed.cern.ch:8443/wmarchive/data/, code=502,
2023-11-08 15:40:46,352:139888411641600:ERROR:ArchiveDataPoller:url=https://xxxx-testbed.cern.ch:port/wmarchive/data/, code=502, reason=Proxy Error, headers={'Date': 'Wed, 08 Nov 2023 14:38:35 GMT', 'Server': 'Apache', 'Content-Length': '556', 'Content-Type': 'text/html; charset=iso-8859-1'}, result=b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n
\nThe proxy server received an invalid\r\nresponse from an upstream server.
\r\nThe proxy server could not handle the request POST /auth/complete/wmarchive/data/.
\nReason: Error reading from remote server
\nAdditionally, a 100 Continue\nerror was encountered while trying to use an ErrorDocument to handle the request.
\n\n'From above error I see that service used `/auth/complete/wmarchive/data` instead of `/wmarchive/data`. But it also says Server Apache which I don't know where it comes from. Do we have additional apache somewhere?
- DBS3Upload is ok, there is no HTTP exceptions
- ErrorHandler is ok, but shows
2023-11-20 15:45:24,887:139888412690176:WARNING:GenericDataCache:Passive failure while looking data up in the memory cache. Error: url=https://xxxx-testbed.cern.ch:port/reqmgr2/data/wmagentconfig/abc.cern.ch, code=500, reason=Internal Server Error, headers={'Date': 'Tue, 30 Apr 2024 07:23:54 GMT', 'Server': 'CherryPy/18.8.0', 'Content-Type': 'text/html;charset=utf-8', 'X-Rest-Status': '400', 'X-Error-Http': '500', 'X-Error-Id': 'db9df5d78f20f4fca074b802a0afa170', 'X-Error-Detail': 'Server error', 'X-Rest-Time': '20211.458 us', 'Content-Length': '742', 'Vary': 'Accept-Encoding', 'CMS-Server-Time': 'D=79233 t=1714461834294807', 'Connection': 'close'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n
\n \nServer error
\n \nIt seems that this error is not related to APS as I checked it right now and it provides correct HTTP response. From the error itself I conclude that it was an issue with backend.
- JobAccountant is ok
- JobArchiver is ok, but shows similar to ErrorHandler error with backend service:
2024-04-30 09:28:57,974:140171168171776:WARNING:GenericDataCache:Passive failure while looking data up in the memory cache. Error: url=https://xxxx-testbed.cern.ch:port/reqmgr2/data/wmagentconfig/abc.cern.ch, code=500, reason=Internal Server Error, headers={'Date': 'Tue, 30 Apr 2024 07:28:57 GMT', 'Server': 'CherryPy/18.8.0', 'Content-Type': 'text/html;charset=utf-8', 'X-Rest-Status': '400', 'X-Error-Http': '500', 'X-Error-Id': '64891f55ea40c4b3ce52d0225b843c74', 'X-Error-Detail': 'Server error', 'X-Rest-Time': '90744.972 us', 'Content-Length': '742', 'Vary': 'Accept-Encoding', 'CMS-Server-Time': 'D=96524 t=1714462137876927', 'Connection': 'close'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n
\n \nServer error
\n \n- JobCreator is ok
- JonStatusLite is ok
- JobSubmitter shows some errors with `/reqmgr2/data/request` end-point but I checked them now and it seems fine (the timestamp of these errors are quite old though):
2024-04-30 09:27:27,856:140171168433920:WARNING:GenericDataCache:Passive failure while looking data up in the memory cache. Error: url=https://xxxx-testbed.cern.ch:port/reqmgr2/data/request?status=aborted&status=aborted-completed&status=force-complete&detail=False, code=500, reason=Internal Server Error, headers={'Date': 'Tue, 30 Apr 2024 07:27:27 GMT', 'Server': 'CherryPy/18.8.0', 'Content-Type': 'text/html;charset=utf-8', 'X-Rest-Status': '400', 'X-Error-Http': '500', 'X-Error-Id': 'fd2a75f531723f03527b3445316cd05d', 'X-Error-Detail': 'Server error', 'X-Rest-Time': '36428.452 us', 'Content-Length': '742', 'Vary': 'Accept-Encoding', 'CMS-Server-Time': 'D=42437 t=1714462047813356', 'Connection': 'close'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional// EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n
\n \nServer error
\n \n
This has already been privately shared with Valentin, but just to record it here as well. Harvesting jobs uploading root files to cmsweb-testbed are actually failing with:
=> URL: https://cmsweb-testbed.cern.ch/dqm/dev
urllib.error.URLError: <urlopen error [Errno 104] Connection reset by peer>
While other workflows uploading root files to cmsweb endpoint instead are successful.
Here is a job log for the failed harvesting job (from vocms0192): https://amaltaro.web.cern.ch/forValentin/wmagentJob.log
Another problem now involving DBS Server. I experienced a similar (or maybe the same) error yesterday, when migrating datasets from production to testbed. Now I found this error in DBS3Upload, during a block injection operation:
2024-06-28 13:27:12,313:139769517713152:ERROR:DBSUploadPoller:Hit a general exception while inserting block /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_TC_EL8_June2024_Val_Alanv2-v11/DQMIO#a94ce7bb-7337-4512-af1e-9564c4980465. Error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)')
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 94, in uploadWorker
dbsApi.insertBulkBlock(blockDump=block)
File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock
result = self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
self.http_response = method_func(self.url, method, params, data, request_headers)
File "/usr/local/lib/python3.8/site-packages/RestClient/RestApi.py", line 42, in post
return http_request(self._curl)
File "/usr/local/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__
curl_object.perform()
pycurl.error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)')
Another problem now involving DBS Server. I experienced a similar (or maybe the same) error yesterday, when migrating datasets from production to testbed. Now I found this error in DBS3Upload, during a block injection operation:
2024-06-28 13:27:12,313:139769517713152:ERROR:DBSUploadPoller:Hit a general exception while inserting block /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_TC_EL8_June2024_Val_Alanv2-v11/DQMIO#a94ce7bb-7337-4512-af1e-9564c4980465. Error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)') Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 94, in uploadWorker dbsApi.insertBulkBlock(blockDump=block) File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock result = self.__callServer("bulkblocks", data=blockDump, callmethod='POST' ) File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer self.http_response = method_func(self.url, method, params, data, request_headers) File "/usr/local/lib/python3.8/site-packages/RestClient/RestApi.py", line 42, in post return http_request(self._curl) File "/usr/local/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__ curl_object.perform() pycurl.error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)')
It seems this issue is resolved with recent upgrade of APS which includes proper handling gzipped response from BE server, see this commit: https://github.com/dmwm/auth-proxy-server/commit/5062c5ac04558a52efa55c9a77d2567200b10d04 The new APS version which does not show this error is 0.2.74
This has already been privately shared with Valentin, but just to record it here as well. Harvesting jobs uploading root files to cmsweb-testbed are actually failing with:
=> URL: https://cmsweb-testbed.cern.ch/dqm/dev urllib.error.URLError: <urlopen error [Errno 104] Connection reset by peer>
While other workflows uploading root files to cmsweb endpoint instead are successful.
Here is a job log for the failed harvesting job (from vocms0192): https://amaltaro.web.cern.ch/forValentin/wmagentJob.log
This seems incorrect redirect rules. Aroosha has changed DQM BE from vocms0731 to vocms0730 and did not propagated changes to APS. I updated APS configs with new BE nodes and we should verify if this issue is gone. @amaltaro it would be nice if you'll re-run workflows as I don't see any updates on wmageJob.log on vocms0192.
I am going through the component logs in the vocms0261 agent and the only issue I found relates to the DBS3Upload component, which I thought it was no longer supposed to be an issue:
2024-07-09 15:24:00,528:140329280993024:INFO:DBSUploadPoller:About to call insert block for: /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_June2024_Val_Alanv6-v11/NANOAODSIM#b11adb0d-cb13-4f
7c-8b9a-ca8dcc0d1892
2024-07-09 15:24:01,306:140329280993024:ERROR:DBSUploadPoller:Hit a general exception while inserting block /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_June2024_Val_Alanv6-v11/GEN-SIM-RECO#125bac2f-5ea7-4dc4-be5d-2f044b4a2508. Error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)')
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 94, in uploadWorker
dbsApi.insertBulkBlock(blockDump=block)
File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock
result = self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
self.http_response = method_func(self.url, method, params, data, request_headers)
File "/usr/local/lib/python3.8/site-packages/RestClient/RestApi.py", line 42, in post
return http_request(self._curl)
File "/usr/local/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__
curl_object.perform()
pycurl.error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)')
I also saw an error with CouchDB replication yesterday afternoon around 6pm CERN time, but that self-recovered (maybe there was another APS intervention yesterday).
@vkuznet can you please check if the fix to compressed response got lost in your pipeline and/or if you deployed an outdated set of rules?
@vkuznet other than the transient DBS3Upload (DBS Server) issue, I think we covered all the validation items in this ticket. Please let me know if you see anything missing and/or that needs extra attention, otherwise we might as well close this out.
Finally I Identified the issue with DBS3Upload mystic message pycurl.error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)')
. Turns out we had four distinct issues:
httputil: ReverseProxy read error during body copy: gzip: invalid header.
. This issue I fixed directly in DBS codebase via the following commit https://github.com/dmwm/dbs2go/commit/8effd5a6bcb1c5b169348e3ac886891ad3aa1a2a200
code in APS logs while BE reported 400
code. The fix was properly done in APS logging module, see the following commit: https://github.com/dmwm/auth-proxy-server/commit/6c486c7aeb299116b02049cb239d6b14105c869cWith these changes in place, I finally tested APS (new version has beed deployed on cmsweb-testbed) with DBS3Upload code base using the following stand-along code:
#!/usr/bin/env python
import json
from dbs.apis.dbsClient import DbsApi
ckey='/home/valya/.globus/userkey.pem'
cert='/home/valya/.globus/usercert.pem'
dbsUrl = 'https://cmsweb-testbed.cern.ch:8443/dbs/int/global/DBSWriter'
finput = '/tmp/dbsuploader_block.json'
useGzip=True # and I tested with False as well
dbsApi = DbsApi(url=dbsUrl, key=ckey, cert=cert, debug=0, useGzip=useGzip)
with open(finput, 'r') as istream:
data = json.loads(istream.read())
results = dbsApi.insertBulkBlock(blockDump=data)
print(results)
Once this code is executed now it properly reports 400
status code and prints out proper DBS client exception, e.g.
python ./dbs_test.py
DBS Server error: [{'error': {'reason': 'nil', 'message': 'Block /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-GenSimFull_TC_EL8_June2024_Val_Alanv7-v11/GEN-SIM#6d85e039-0943-43d2-befc-1f81a4d64400 already exists', 'function': 'dbs.bulkblocks.checkBlockExist', 'code': 128, 'stacktrace': '\ngoroutine 1211 [running]:\ngithub.com/dmwm/dbs2go/dbs.Error({0x0?, 0x0?}, 0x80, {0xc000636aa0, 0x91}, {0xa50b2c, 0x1e})\n\t/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99\ngithub.com/dmwm/dbs2go/dbs.checkBlockExist({0xc00053b380, 0x7c}, {0xc00042d440, 0x40})\n\t/go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:457 +0x20c\ngithub.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc00050e150)\n\t/go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:526 +0x705\ngithub.com/dmwm/dbs2go/web.DBSPostHandler({0xb2f790, 0xc00011cab0}, 0xc000592c60, {0xa3e07d, 0xa})\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:562 +0x109e\ngithub.com/dmwm/dbs2go/web.BulkBlocksHandler({0xb2f790?, 0xc00011cab0?}, 0xc00068fcb0?)\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:978 +0x3b\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0xb2f790?, 0xc00011cab0?}, 0x11?)\n\t/usr/local/go/src/net/http/server.go:2171 +0x29\ngithub.com/dmwm/dbs2go/web.limitMiddleware.func1({0xb2f790?, 0xc00011cab0?}, 0xc00007aff0?)\n\t/go/src/github.com/vkuznet/'}, 'http': {'method': 'POST', 'code': 400, 'timestamp': '2024-07-12 14:01:47.313558311 +0000 UTC m=+962.019408520', 'path': '/dbs/int/global/DBSWriter/bulkblocks', 'user_agent': 'DBSClient/Unknown/', 'x_forwarded_host': 'cmsweb-testbed.cern.ch', 'x_forwarded_for': '188.184.103.72:49182, 188.184.103.72', 'remote_addr': '10.100.148.128:49984'}, 'exception': 400, 'type': 'HTTPError', 'message': 'DBSError Code:128 Description:Not defined Function:dbs.bulkblocks.checkBlockExist Message:Block /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-GenSimFull_TC_EL8_June2024_Val_Alanv7-v11/GEN-SIM#6d85e039-0943-43d2-befc-1f81a4d64400 already exists Error: nil'}]
Traceback (most recent call last):
File "/home/valya/venv/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
self.http_response = method_func(self.url, method, params, data, request_headers)
File "/home/valya/venv/lib64/python3.9/site-packages/RestClient/RestApi.py", line 42, in post
return http_request(self._curl)
File "/home/valya/venv/lib64/python3.9/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 62, in __call__
raise HTTPError(effective_url, http_code, http_response.msg, http_response.raw_header, http_response.body)
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: Bad Request
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/valya/./dbs_test.py", line 25, in <module>
results = dbsApi.insertBulkBlock(blockDump=data)
File "/home/valya/venv/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock
result = self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
File "/home/valya/venv/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 486, in __callServer
self.__parseForException(data)
File "/home/valya/venv/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 524, in __parseForException
raise exp
File "/home/valya/venv/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 522, in __parseForException
raise HTTPError(http_error.url, data['exception'], data['message'], http_error.header, http_error.body, serverCode(data))
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: DBSError Code:128 Description:Not defined Function:dbs.bulkblocks.checkBlockExist Message:Block /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-GenSimFull_TC_EL8_June2024_Val_Alanv7-v11/GEN-SIM#6d85e039-0943-43d2-befc-1f81a4d64400 already exists Error: nil
It also can be tested using plain curl call, e.g.
scurl -v https://cmsweb-testbed:8443/dbs/int/global/DBSWriter/bulkblocks -H "Content-type: application/json" -d /tmp/dbsuploader_block.json
...
< HTTP/2 400
...
[{"error":{"reason":"invalid character '/' looking for beginning of value","message":"","function":"dbs.bulkblocks.InsertBulkBlocksConcurrently","code":122,"stacktrace":"\ngoroutine 1593 [running]:\ngithub.com/dmwm/dbs2go/dbs.Error({0xb2c0c0?, 0xc000146240?}, 0x7a, {0x0, 0x0}, {0xa5c044, 0x2b})\n\t/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99\ngithub.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc000446070)\n\t/go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:497 +0x3c5\ngithub.com/dmwm/dbs2go/web.DBSPostHandler({0xb2f790, 0xc0001461f8}, 0xc000610c60, {0xa3e07d, 0xa})\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:562 +0x109e\ngithub.com/dmwm/dbs2go/web.BulkBlocksHandler({0xb2f790?, 0xc0001461f8?}, 0xc0002384f0?)\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:978 +0x3b\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0xb2f790?, 0xc0001461f8?}, 0x11?)\n\t/usr/local/go/src/net/http/server.go:2171 +0x29\ngithub.com/dmwm/dbs2go/web.limitMiddleware.func1({0xb2f790?, 0xc0001461f8?}, 0xc* Connection #0 to host cmsweb-testbed left intact
00091e4f0?)\n\t/go/src/github.com/vkuznet/dbs2go/web/middlewares.go:110 +0x32\nnet/http.HandlerFunc.ServeHTTP(0xc0005ef0b0?, {0xb2f790?, 0xc0001461f8?}, 0xc0002a8540?)\n\t/usr/local/go/src/net/"},"http":{"method":"POST","code":400,"timestamp":"2024-07-12 14:07:31.158800617 +0000 UTC m=+1291.809024265","path":"/dbs/int/global/DBSWriter/bulkblocks","user_agent":"curl/7.76.1","x_forwarded_host":"cmsweb-testbed.cern.ch","x_forwarded_for":"188.184.103.72:60646, 188.184.103.72","remote_addr":"10.100.10.128:14670"},"exception":400,"type":"HTTPError","message":"DBSError Code:122 Description:DBS unable to parse JSON record Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message: Error: invalid character '/' looking for beginning of value"}]
To summarize, the DBS3Upload error reveals 4 different issues in different places. The underlying problem with DBS3Upload that sometimes DBS server reports the error, e.g. when the same block exists, but this error was neither found in DBS logs (as they never appeared in k8s), nor properly propagated back to the caller due to issue with DBS codebase messing with HTTP response and headers and APS not properly reading the payload from back-end DBS server. But the DBS3Upload call itself was "successful" in DBS meaning it passed correctly to DBS backend server and executed by it. We only fail to properly get back the response.
Final NOTE:
I discovered through this debug process that Go-based HTTP server (in my case reverse proxy one) by default enables gzip encoding via Accept-Encoding
HTTP header, see https://cs.opensource.google/go/go/+/refs/tags/go1.22.5:src/net/http/transport.go;l=2618-2640 This is not the case of apache and most likely the rational behind is to force back-end servers to supply compressed payload of data. I want to document this feature to avoid further question of why gzip encoding appears in APS by default and it is not present in apache FE.
I would like to add additional information about patched versions:
v00.06.44
which is deployed on cmsweb-testbed0.2.76
and it is deployed on cmsweb-testbedThank you for these details, Valentin. Once fixes have been deployed in cmsweb-preprod, I think we should go ahead and run another couple of workflows.
In addition, we should scan WM central services logs once we have a full day of stable APS (no interventions, no updates, etc) in preprod.
Impact of the new feature WM services in general
Is your feature request related to a problem? Please describe. In the context of migrating services/nodes from SLC7 to Alma9, we have decided to adopt the APS frontend service instead of upgrading the Apache package.
APS redirect rules for pre-production can be found at: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/blob/preprod/auth-proxy-server/config.json
Describe the solution you'd like We need to have a thorough validation of the WM services, as we get prepared to migrate these settings/server to production as well. A non-exhaustive list of actions to be performed are:
Describe alternatives you've considered None
Additional context None