dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
45 stars 106 forks source link

Validate WM services with APS frontends in testbed #12026

Open amaltaro opened 2 weeks ago

amaltaro commented 2 weeks ago

Impact of the new feature WM services in general

Is your feature request related to a problem? Please describe. In the context of migrating services/nodes from SLC7 to Alma9, we have decided to adopt the APS frontend service instead of upgrading the Apache package.

APS redirect rules for pre-production can be found at: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/blob/preprod/auth-proxy-server/config.json

Describe the solution you'd like We need to have a thorough validation of the WM services, as we get prepared to migrate these settings/server to production as well. A non-exhaustive list of actions to be performed are:

Describe alternatives you've considered None

Additional context None

vkuznet commented 2 weeks ago

Regarding first item, I scanned APS logs for requests with status not equal to 200, e.g.

grep proxy-resp /cephfs/testbed/aps-logs/auth-proxy-server.log_x509-proxy-server-*20240627 | egrep -v " 200 GET | 200 POST | 200 PUT |no-auth-cert| 200 HEAD | 200 DELETE " | wc -l
0

The no-auth-cert requests are from CERN IT penetration tests, and as you can see from above I found 0 requests outside of 200 status. The total number of records for XPS (APS running on port 8443 with x509) is 313K. For 20240626 date we have 564K entries in XPS logs and again 0 requests with non 200 status. Therefore, I would declare that APS log checking is complete.

For completeness, the APS logs (port 443) we have 165K entries for 20240627, 67K for 20240626, and 0 requests again with non 200 status.

vkuznet commented 2 weeks ago

Also, thanks to Nikodemas we now have initial monitoring dashboard for APS which you can check out over here. We are adding more stuff to it and once completed it will show performance metrics of APS/XPS servers.

amaltaro commented 2 weeks ago

Thanks Valentin and Nikodemas! If you have fully validated any of the items in the check-list, feel free to check that out then.

And if you see anything else missing from the WM perspective, please share with us, such that we can add it to the check-list as well.

vkuznet commented 2 weeks ago

Alan, here is report from all logs, including services and microservices where I used today's logs (20240627):

reqmon 200 OK

workqueue ok

t0_reqmon_tasks 200 OK

t0_reqmon 200 OK

reqmon-tasks 200 OK

reqmon 200 OK

reqmgr2ms-ms-unmer-t2t3us 200 OK, but I saw
2024-06-27 00:58:50,587:ERROR:MSUnmerged: Could not fetch statistics from Rucio Consistency Monitor. Skipping the current run.

reqmgr2ms-ms-unmer-t2t3 200 OK, but I saw
2024-06-27 00:58:50,587:ERROR:MSUnmerged: Could not fetch statistics from Rucio Consistency Monitor. Skipping the current run.

reqmgr2ms-20240627-ms-unmer-t1 200 OK

reqmgr2ms-20240627-ms-transferor 200 OK, but I saw
2024-06-27 00:02:19,405:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /RelValTTbar_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM. Response: {'url': 'http://xxx-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/RelValTTbar_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Thu, 27 Jun 2024 00:02:19 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: xxx-rucio.cern.ch\r\n\r\n'}

reqmgr2ms-20240627-ms-rulecleaner 200 OK

reqmgr2ms-20240627-ms-pileup-tasks 200 OK

reqmgr2ms-20240627-ms-pileup 200 OK, but
grep -v "200 OK" /cephfs/testbed/dmwm-logs/reqmgr2ms-20240627-ms-pileup-596fb7bdc7-lgrpd.log | grep -i HTTP

...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSPileup/MSPileupData.py", line 157, in createPileup
    obj = MSPileupObj(pdict, validRSEs=rseList)
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSPileup/DataStructs/MSPileupObj.py", line 71, in __init__
    raise Exception(msg)
Exception: MSPileup input is invalid, expectedRSEs value ['T1_US_FNAL_Disk', 'T2_CH_CERN'] is not in validRSEs ['T2_US_Nebraska', '
T2_US_UCSD', 'T2_US_Wisconsin', 'T2_US_Purdue']
....
[27/Jun/2024:18:50:39] ms-pileup-596fb7bdc7-lgrpd 188.185.123.179:28808 "POST /ms-pileup/data/pileup HTTP/1.1" 400 Bad Request [data: 1630 in 931 out 161494 us ] [auth: ok "Robot t1" "" ] [ref: "https://xxx-testbed.cern.ch" "PycURL/7.45.2 libcurl/7.74.0 OpenSSL/1.1.1n zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh2/1.9.0 nghttp2/1.43.0 librtmp/2.3" ]
...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSPileup/MSPileupData.py", line 157, in createPileup
    obj = MSPileupObj(pdict, validRSEs=rseList)
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSPileup/DataStructs/MSPileupObj.py", line 71, in __init__
    raise Exception(msg)
Exception: MSPileup input is invalid, expectedRSEs value ['T1_US_FNAL_Disk', 'T2_CH_CERN'] is not in validRSEs ['T2_US_Nebraska', '
T2_US_UCSD', 'T2_US_Wisconsin', 'T2_US_Purdue']
...
[27/Jun/2024:18:52:00] ms-pileup-596fb7bdc7-lgrpd IPAddress:12756 "POST /ms-pileup/data/pileup HTTP/1.1" 400 Bad Request [data: 1622 in 931 out 90621 us ] [auth: ok "Robot t1" "" ] [ref: "https://xxx-testbed.cern.ch" "PycURL/7.45.2 libcurl/7.74.0 OpenSSL/1.1.1n zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh2/1.9.0 nghttp2/1.43.0 librtmp/2.3" ]

reqmgr2ms-20240627-ms-output 200 OK, but there are rucio failures, e.g.
2024-06-27 19:01:26,788:DEBUG:connectionpool: http://xxx-rucio-int.cern.ch:port "GET /rses/?expression=rse_type%3DTAPE%5C%28cms_type%3Dtest%7Crse%3DT1_RU_JINR_Tape%29 HTTP/1.1" 401 106

reqmgr2ms-20240627-ms-monitor 200 OK, but there are rucio failures, e.g.
2024-06-27 18:56:11,471:DEBUG:connectionpool: http://xxx-rucio.cern.ch:80 "GET /rules/ffd9fe4a39ff4b7abe26d8496d66874f HTTP/1.1" 404 116

reqmgr2-20240627-reqmgr2-tasks 200 OK

reqmgr2-20240627-reqmgr2 200 OK, but
[27/Jun/2024:07:57:34] reqmgr2-67b86b4556-gt426 IPAddress:13572 "DELETE /reqmgr2/data/transferinfo/spiga_task_BPH-RunIISummer20UL18GEN-Backfill-00262__v1_T_240622_115214_9893 HTTP/1.1" 411 Length Required [data: 1464 in - out 402 us ] [auth: ok "Robot: WmCore Service Account" "" ] [ref: "https://xxx-testbed.cern.ch" "WMCore.Services.Requests/v002" ]
...
[27/Jun/2024:14:21:58] reqmgr2-67b86b4556-gt426 IPAddress:50420 "GET /reqmgr2/data/wrong_endpoint HTTP/1.1" 404 Not Found [data: 5723 in 723 out 1013 us ] [auth: ok "Robot: CMS Build Bot" "" ] [ref: "https://cmsweb-testbed.cern.ch" "PycURL/7.45.1 libcurl/7.59.0 OpenSSL/1.0.2k zlib/1.2.8" ]
...
[27/Jun/2024:14:21:58] reqmgr2-67b86b4556-gt426 IPAddress:50428 "HEAD /reqmgr2/data/wrong_endpoint HTTP/1.1" 404 Not Found [data: 10123 in 723 out 937 us ] [auth: ok "Robot: CMS Build Bot" "" ] [ref: "https://xxx-testbed.cern.ch" "PycURL/7.45.1 libcurl/7.59.0 OpenSSL/1.0.2k zlib/1.2.8" ]

Bottom line I found only 1 "missing" APIs:

/reqmgr2/data/wrong_endpoint HTTP/1.1" 404 Not Found

and one DELETE API with wrong parameter:

DELETE /reqmgr2/data/transferinfo/spiga_task_BPH-RunIISummer20UL18GEN-Backfill-00262__v1_T_240622_115214_9893 HTTP/1.1" 411 Length Required

In this HTTP call the client didn't provide content length, see 411 Length Required. It seems the issue is WM client which improperly construct HTTP request and APS server rejects it based on HTTP specs.

vkuznet commented 2 weeks ago

Now moving slowly with component logs: WMAgent logs:

http.client.HTTPException: url=https://xxxx-testbed.cern.ch:port/couchdb/wmstats/_design/WMStats/_view/byAgentURL?reduce=false&stale=update_after, code=503, reason=Service Unavailable

I checked this URL now and it works, it comes before we fixed byAgentURL redirect rule in APS

- ArchiveDataReporter reports the following error:

url=https://cmsweb-testbed.cern.ch:8443/wmarchive/data/, code=502,

the actual error looks like this:

2023-11-08 15:40:46,352:139888411641600:ERROR:ArchiveDataPoller:url=https://xxxx-testbed.cern.ch:port/wmarchive/data/, code=502, reason=Proxy Error, headers={'Date': 'Wed, 08 Nov 2023 14:38:35 GMT', 'Server': 'Apache', 'Content-Length': '556', 'Content-Type': 'text/html; charset=iso-8859-1'}, result=b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n\n502 Proxy Error\n\n

Proxy Error

\n

The proxy server received an invalid\r\nresponse from an upstream server.
\r\nThe proxy server could not handle the request POST /auth/complete/wmarchive/data/.

\nReason: Error reading from remote server

\n

Additionally, a 100 Continue\nerror was encountered while trying to use an ErrorDocument to handle the request.

\n\n'

From above error I see that service used `/auth/complete/wmarchive/data` instead of `/wmarchive/data`. But it also says Server Apache which I don't know where it comes from. Do we have additional apache somewhere?
- DBS3Upload is ok, there is no HTTP exceptions
- ErrorHandler is ok, but shows

2023-11-20 15:45:24,887:139888412690176:WARNING:GenericDataCache:Passive failure while looking data up in the memory cache. Error: url=https://xxxx-testbed.cern.ch:port/reqmgr2/data/wmagentconfig/abc.cern.ch, code=500, reason=Internal Server Error, headers={'Date': 'Tue, 30 Apr 2024 07:23:54 GMT', 'Server': 'CherryPy/18.8.0', 'Content-Type': 'text/html;charset=utf-8', 'X-Rest-Status': '400', 'X-Error-Http': '500', 'X-Error-Id': 'db9df5d78f20f4fca074b802a0afa170', 'X-Error-Detail': 'Server error', 'X-Rest-Time': '20211.458 us', 'Content-Length': '742', 'Vary': 'Accept-Encoding', 'CMS-Server-Time': 'D=79233 t=1714461834294807', 'Connection': 'close'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n \n 500 Internal Server Error\n \n\n \n

500 Internal Server Error

\n

Server error

\n
\n    
\n \n Powered by CherryPy 18.8.0\n \n
\n \n\n'

It seems that this error is not related to APS as I checked it right now and it provides correct HTTP response. From the error itself I conclude that it was an issue with backend.
- JobAccountant is ok
- JobArchiver is ok, but shows similar  to ErrorHandler error with backend service:

2024-04-30 09:28:57,974:140171168171776:WARNING:GenericDataCache:Passive failure while looking data up in the memory cache. Error: url=https://xxxx-testbed.cern.ch:port/reqmgr2/data/wmagentconfig/abc.cern.ch, code=500, reason=Internal Server Error, headers={'Date': 'Tue, 30 Apr 2024 07:28:57 GMT', 'Server': 'CherryPy/18.8.0', 'Content-Type': 'text/html;charset=utf-8', 'X-Rest-Status': '400', 'X-Error-Http': '500', 'X-Error-Id': '64891f55ea40c4b3ce52d0225b843c74', 'X-Error-Detail': 'Server error', 'X-Rest-Time': '90744.972 us', 'Content-Length': '742', 'Vary': 'Accept-Encoding', 'CMS-Server-Time': 'D=96524 t=1714462137876927', 'Connection': 'close'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n \n 500 Internal Server Error\n \n\n \n

500 Internal Server Error

\n

Server error

\n
\n    
\n \n Powered by CherryPy 18.8.0\n \n
\n \n\n'

- JobCreator is ok
- JonStatusLite is ok
- JobSubmitter shows some errors with `/reqmgr2/data/request` end-point but I checked them now and it seems fine (the timestamp of these errors are quite old though):

2024-04-30 09:27:27,856:140171168433920:WARNING:GenericDataCache:Passive failure while looking data up in the memory cache. Error: url=https://xxxx-testbed.cern.ch:port/reqmgr2/data/request?status=aborted&status=aborted-completed&status=force-complete&detail=False, code=500, reason=Internal Server Error, headers={'Date': 'Tue, 30 Apr 2024 07:27:27 GMT', 'Server': 'CherryPy/18.8.0', 'Content-Type': 'text/html;charset=utf-8', 'X-Rest-Status': '400', 'X-Error-Http': '500', 'X-Error-Id': 'fd2a75f531723f03527b3445316cd05d', 'X-Error-Detail': 'Server error', 'X-Rest-Time': '36428.452 us', 'Content-Length': '742', 'Vary': 'Accept-Encoding', 'CMS-Server-Time': 'D=42437 t=1714462047813356', 'Connection': 'close'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional// EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n \n 500 Internal Server Error\n \n\n \n

500 Internal Server Error

\n

Server error

\n
\n    
\n \n Powered by CherryPy 18.8.0\n \n
\n \n\n'

amaltaro commented 2 weeks ago

This has already been privately shared with Valentin, but just to record it here as well. Harvesting jobs uploading root files to cmsweb-testbed are actually failing with:

 => URL: https://cmsweb-testbed.cern.ch/dqm/dev
urllib.error.URLError: <urlopen error [Errno 104] Connection reset by peer>

While other workflows uploading root files to cmsweb endpoint instead are successful.

Here is a job log for the failed harvesting job (from vocms0192): https://amaltaro.web.cern.ch/forValentin/wmagentJob.log

amaltaro commented 2 weeks ago

Another problem now involving DBS Server. I experienced a similar (or maybe the same) error yesterday, when migrating datasets from production to testbed. Now I found this error in DBS3Upload, during a block injection operation:

2024-06-28 13:27:12,313:139769517713152:ERROR:DBSUploadPoller:Hit a general exception while inserting block /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_TC_EL8_June2024_Val_Alanv2-v11/DQMIO#a94ce7bb-7337-4512-af1e-9564c4980465. Error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 94, in uploadWorker
    dbsApi.insertBulkBlock(blockDump=block)
  File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/usr/local/lib/python3.8/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/usr/local/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__
    curl_object.perform()
pycurl.error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)')
vkuznet commented 2 weeks ago

Another problem now involving DBS Server. I experienced a similar (or maybe the same) error yesterday, when migrating datasets from production to testbed. Now I found this error in DBS3Upload, during a block injection operation:

2024-06-28 13:27:12,313:139769517713152:ERROR:DBSUploadPoller:Hit a general exception while inserting block /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_TC_EL8_June2024_Val_Alanv2-v11/DQMIO#a94ce7bb-7337-4512-af1e-9564c4980465. Error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 94, in uploadWorker
    dbsApi.insertBulkBlock(blockDump=block)
  File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/usr/local/lib/python3.8/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/usr/local/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__
    curl_object.perform()
pycurl.error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)')

It seems this issue is resolved with recent upgrade of APS which includes proper handling gzipped response from BE server, see this commit: https://github.com/dmwm/auth-proxy-server/commit/5062c5ac04558a52efa55c9a77d2567200b10d04 The new APS version which does not show this error is 0.2.74

vkuznet commented 2 weeks ago

This has already been privately shared with Valentin, but just to record it here as well. Harvesting jobs uploading root files to cmsweb-testbed are actually failing with:

 => URL: https://cmsweb-testbed.cern.ch/dqm/dev
urllib.error.URLError: <urlopen error [Errno 104] Connection reset by peer>

While other workflows uploading root files to cmsweb endpoint instead are successful.

Here is a job log for the failed harvesting job (from vocms0192): https://amaltaro.web.cern.ch/forValentin/wmagentJob.log

This seems incorrect redirect rules. Aroosha has changed DQM BE from vocms0731 to vocms0730 and did not propagated changes to APS. I updated APS configs with new BE nodes and we should verify if this issue is gone. @amaltaro it would be nice if you'll re-run workflows as I don't see any updates on wmageJob.log on vocms0192.

amaltaro commented 5 days ago

I am going through the component logs in the vocms0261 agent and the only issue I found relates to the DBS3Upload component, which I thought it was no longer supposed to be an issue:

2024-07-09 15:24:00,528:140329280993024:INFO:DBSUploadPoller:About to call insert block for: /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_June2024_Val_Alanv6-v11/NANOAODSIM#b11adb0d-cb13-4f
7c-8b9a-ca8dcc0d1892
2024-07-09 15:24:01,306:140329280993024:ERROR:DBSUploadPoller:Hit a general exception while inserting block /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_June2024_Val_Alanv6-v11/GEN-SIM-RECO#125bac2f-5ea7-4dc4-be5d-2f044b4a2508. Error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 94, in uploadWorker
    dbsApi.insertBulkBlock(blockDump=block)
  File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/usr/local/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/usr/local/lib/python3.8/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/usr/local/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__
    curl_object.perform()
pycurl.error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)')

I also saw an error with CouchDB replication yesterday afternoon around 6pm CERN time, but that self-recovered (maybe there was another APS intervention yesterday).

@vkuznet can you please check if the fix to compressed response got lost in your pipeline and/or if you deployed an outdated set of rules?

amaltaro commented 5 days ago

@vkuznet other than the transient DBS3Upload (DBS Server) issue, I think we covered all the validation items in this ticket. Please let me know if you see anything missing and/or that needs extra attention, otherwise we might as well close this out.

vkuznet commented 3 days ago

Finally I Identified the issue with DBS3Upload mystic message pycurl.error: (92, 'HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)'). Turns out we had four distinct issues:

  1. On k8s infrastructure dbs logs somehow where not written anymore due to permission issue on mounted cephfs area. We debug this issue with Aroosha and came up with solution provided over here: https://github.com/dmwm/CMSKubernetes/pull/1522
  2. The issue with dbs codebase which improperly wrote HTTP reply in case of error in DBS requests. The incoming HTTP request asks for gzip encoding and dbs server wrote back plain data error JSON which cause another mysterious message in APS logs:httputil: ReverseProxy read error during body copy: gzip: invalid header.. This issue I fixed directly in DBS codebase via the following commit https://github.com/dmwm/dbs2go/commit/8effd5a6bcb1c5b169348e3ac886891ad3aa1a2a
  3. APS incorrectly put HTTP status code in a log. It put its own HTTP status code instead of the code returned by back-end server. This cause to have 200 code in APS logs while BE reported 400 code. The fix was properly done in APS logging module, see the following commit: https://github.com/dmwm/auth-proxy-server/commit/6c486c7aeb299116b02049cb239d6b14105c869c
  4. APS code issue with reading HTTP response payload data and handle it properly in reverse proxy module, i.e. if HTTP request asked for gzip data and BE server fails it should return gzip encoding JSON and APS should read it using gzip reader. This was fixed in the following commits:

With these changes in place, I finally tested APS (new version has beed deployed on cmsweb-testbed) with DBS3Upload code base using the following stand-along code:

#!/usr/bin/env python

import json
from dbs.apis.dbsClient import DbsApi

ckey='/home/valya/.globus/userkey.pem'
cert='/home/valya/.globus/usercert.pem'

dbsUrl = 'https://cmsweb-testbed.cern.ch:8443/dbs/int/global/DBSWriter'
finput = '/tmp/dbsuploader_block.json'
useGzip=True # and I tested with False as well
dbsApi = DbsApi(url=dbsUrl, key=ckey, cert=cert, debug=0, useGzip=useGzip)
with open(finput, 'r') as istream:
    data = json.loads(istream.read())
    results = dbsApi.insertBulkBlock(blockDump=data)
    print(results)

Once this code is executed now it properly reports 400 status code and prints out proper DBS client exception, e.g.

python ./dbs_test.py
DBS Server error: [{'error': {'reason': 'nil', 'message': 'Block /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-GenSimFull_TC_EL8_June2024_Val_Alanv7-v11/GEN-SIM#6d85e039-0943-43d2-befc-1f81a4d64400 already exists', 'function': 'dbs.bulkblocks.checkBlockExist', 'code': 128, 'stacktrace': '\ngoroutine 1211 [running]:\ngithub.com/dmwm/dbs2go/dbs.Error({0x0?, 0x0?}, 0x80, {0xc000636aa0, 0x91}, {0xa50b2c, 0x1e})\n\t/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99\ngithub.com/dmwm/dbs2go/dbs.checkBlockExist({0xc00053b380, 0x7c}, {0xc00042d440, 0x40})\n\t/go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:457 +0x20c\ngithub.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc00050e150)\n\t/go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:526 +0x705\ngithub.com/dmwm/dbs2go/web.DBSPostHandler({0xb2f790, 0xc00011cab0}, 0xc000592c60, {0xa3e07d, 0xa})\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:562 +0x109e\ngithub.com/dmwm/dbs2go/web.BulkBlocksHandler({0xb2f790?, 0xc00011cab0?}, 0xc00068fcb0?)\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:978 +0x3b\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0xb2f790?, 0xc00011cab0?}, 0x11?)\n\t/usr/local/go/src/net/http/server.go:2171 +0x29\ngithub.com/dmwm/dbs2go/web.limitMiddleware.func1({0xb2f790?, 0xc00011cab0?}, 0xc00007aff0?)\n\t/go/src/github.com/vkuznet/'}, 'http': {'method': 'POST', 'code': 400, 'timestamp': '2024-07-12 14:01:47.313558311 +0000 UTC m=+962.019408520', 'path': '/dbs/int/global/DBSWriter/bulkblocks', 'user_agent': 'DBSClient/Unknown/', 'x_forwarded_host': 'cmsweb-testbed.cern.ch', 'x_forwarded_for': '188.184.103.72:49182, 188.184.103.72', 'remote_addr': '10.100.148.128:49984'}, 'exception': 400, 'type': 'HTTPError', 'message': 'DBSError Code:128 Description:Not defined Function:dbs.bulkblocks.checkBlockExist Message:Block /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-GenSimFull_TC_EL8_June2024_Val_Alanv7-v11/GEN-SIM#6d85e039-0943-43d2-befc-1f81a4d64400 already exists Error: nil'}]
Traceback (most recent call last):
  File "/home/valya/venv/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/home/valya/venv/lib64/python3.9/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/home/valya/venv/lib64/python3.9/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 62, in __call__
    raise HTTPError(effective_url, http_code, http_response.msg, http_response.raw_header, http_response.body)
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/valya/./dbs_test.py", line 25, in <module>
    results = dbsApi.insertBulkBlock(blockDump=data)
  File "/home/valya/venv/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/home/valya/venv/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 486, in __callServer
    self.__parseForException(data)
  File "/home/valya/venv/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 524, in __parseForException
    raise exp
  File "/home/valya/venv/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 522, in __parseForException
    raise HTTPError(http_error.url, data['exception'], data['message'], http_error.header, http_error.body, serverCode(data))
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: DBSError Code:128 Description:Not defined Function:dbs.bulkblocks.checkBlockExist Message:Block /RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-GenSimFull_TC_EL8_June2024_Val_Alanv7-v11/GEN-SIM#6d85e039-0943-43d2-befc-1f81a4d64400 already exists Error: nil

It also can be tested using plain curl call, e.g.

scurl -v https://cmsweb-testbed:8443/dbs/int/global/DBSWriter/bulkblocks -H "Content-type: application/json" -d /tmp/dbsuploader_block.json
...
< HTTP/2 400
...
[{"error":{"reason":"invalid character '/' looking for beginning of value","message":"","function":"dbs.bulkblocks.InsertBulkBlocksConcurrently","code":122,"stacktrace":"\ngoroutine 1593 [running]:\ngithub.com/dmwm/dbs2go/dbs.Error({0xb2c0c0?, 0xc000146240?}, 0x7a, {0x0, 0x0}, {0xa5c044, 0x2b})\n\t/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99\ngithub.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc000446070)\n\t/go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:497 +0x3c5\ngithub.com/dmwm/dbs2go/web.DBSPostHandler({0xb2f790, 0xc0001461f8}, 0xc000610c60, {0xa3e07d, 0xa})\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:562 +0x109e\ngithub.com/dmwm/dbs2go/web.BulkBlocksHandler({0xb2f790?, 0xc0001461f8?}, 0xc0002384f0?)\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:978 +0x3b\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0xb2f790?, 0xc0001461f8?}, 0x11?)\n\t/usr/local/go/src/net/http/server.go:2171 +0x29\ngithub.com/dmwm/dbs2go/web.limitMiddleware.func1({0xb2f790?, 0xc0001461f8?}, 0xc* Connection #0 to host cmsweb-testbed left intact
00091e4f0?)\n\t/go/src/github.com/vkuznet/dbs2go/web/middlewares.go:110 +0x32\nnet/http.HandlerFunc.ServeHTTP(0xc0005ef0b0?, {0xb2f790?, 0xc0001461f8?}, 0xc0002a8540?)\n\t/usr/local/go/src/net/"},"http":{"method":"POST","code":400,"timestamp":"2024-07-12 14:07:31.158800617 +0000 UTC m=+1291.809024265","path":"/dbs/int/global/DBSWriter/bulkblocks","user_agent":"curl/7.76.1","x_forwarded_host":"cmsweb-testbed.cern.ch","x_forwarded_for":"188.184.103.72:60646, 188.184.103.72","remote_addr":"10.100.10.128:14670"},"exception":400,"type":"HTTPError","message":"DBSError Code:122 Description:DBS unable to parse JSON record Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message: Error: invalid character '/' looking for beginning of value"}]

To summarize, the DBS3Upload error reveals 4 different issues in different places. The underlying problem with DBS3Upload that sometimes DBS server reports the error, e.g. when the same block exists, but this error was neither found in DBS logs (as they never appeared in k8s), nor properly propagated back to the caller due to issue with DBS codebase messing with HTTP response and headers and APS not properly reading the payload from back-end DBS server. But the DBS3Upload call itself was "successful" in DBS meaning it passed correctly to DBS backend server and executed by it. We only fail to properly get back the response.

Final NOTE: I discovered through this debug process that Go-based HTTP server (in my case reverse proxy one) by default enables gzip encoding via Accept-Encoding HTTP header, see https://cs.opensource.google/go/go/+/refs/tags/go1.22.5:src/net/http/transport.go;l=2618-2640 This is not the case of apache and most likely the rational behind is to force back-end servers to supply compressed payload of data. I want to document this feature to avoid further question of why gzip encoding appears in APS by default and it is not present in apache FE.

vkuznet commented 3 days ago

I would like to add additional information about patched versions:

amaltaro commented 3 days ago

Thank you for these details, Valentin. Once fixes have been deployed in cmsweb-preprod, I think we should go ahead and run another couple of workflows.

In addition, we should scan WM central services logs once we have a full day of stable APS (no interventions, no updates, etc) in preprod.