dmwm / CRABClient

runrange
14 stars 36 forks source link

CRABRestInterface parses curl stdout and stderr at same time #5159

Open belforte opened 2 years ago

belforte commented 2 years ago

in CRABRestInerface the output of curl is piped to stderr, this makes the code wrongly assume that the call to serve failed if the output contains a string reporting an HTTP error, like when task submission failed due to an HTTP error in TaskWorker. In following example task submission failed with the "typical" proxy error 502

tm_task_failure | Problem handling 220317_170502:cmsbot_crab_outLFNDirBase because of DBSReaderError Message: Can't contact DBS at https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader, got errors Instantiating DBS3Reader failed with HTTP Error 502: URL=https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/serverinfo Code=502 Message=Proxy Error Header=HTTP/1.1 502 Proxy Error Date: Thu, 17 Mar 2022 17:14:10 GMT Server: Apache Content-Length: 497 Content-Type: text/html; charset=iso-8859-1 Body= Proxy Error The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /auth/complete/dbs/prod/global/DBSReader/serverinfo.

but crab status just prints:

belforte@lxplus785/bot> crab status 
Fatal error trying to connect to https://cmsweb-test1.cern.ch:8443/crabserver/dev/task?subresource=search&workflow=220317_170502%3Acmsbot_crab_outLFNDirBase using subresource=search&workflow=220317_170502%3Acmsbot_crab_outLFNDirBase.
The server answered with an error.
belforte@lxplus785/bot> 

problematic code is (I believe) https://github.com/dmwm/CRABClient/blob/37caaf2fbfdec47cecc7c9a5820f14e3252ca11e/src/python/CRABClient/CrabRestInterface.py#L200-L208 combined with https://github.com/dmwm/CRABClient/blob/37caaf2fbfdec47cecc7c9a5820f14e3252ca11e/src/python/CRABClient/CrabRestInterface.py#L58-L79

I am surprised that this did not appear earlier. Maybe it is another side effect of new message handling in REST (the b64 saga) ? the message is 11.8K characters, maybe earlier it was truncated and this made it formatted in such a way not to tool parseResponseHeader ?

In any case we should not mix HTTP Header and Body in this way, need to understand why we have |tee /dev/stderr in https://github.com/dmwm/CRABClient/blob/37caaf2fbfdec47cecc7c9a5820f14e3252ca11e/src/python/CRABClient/CrabRestInterface.py#L200

belforte commented 2 years ago

the |tee /dev/stderr was added in https://github.com/ddaina/CRABClient/commit/80688c5b2a433047b57bc9fda19a3ab38639dafc in order to solve https://github.com/dmwm/CRABClient/issues/5067

It was a tricky point. And I am afraid we need to revisit. @mapellidario do you feel like helping ?

belforte commented 2 years ago

rats !! this happens also on tasks on production server using production client like https://cmsweb.cern.ch/crabserver/ui/task/220317_212302%3Amstojano_crab_K892_pT30 where I get after crab remake :

belforte@lxplus785/bug> crab --version
CRAB client v3.220210
belforte@lxplus785/bug> crab status crab_K892_pT30/
Fatal error trying to connect to https://cmsweb.cern.ch:8443/crabserver/prod/task?subresource=search&workflow=220317_212302%3Amstojano_crab_K892_pT30 using subresource=search&workflow=220317_212302%3Amstojano_crab_K892_pT30.
The server answered with an error.
belforte@lxplus785/bug> 

increasing priority :-(

belforte commented 2 years ago

I am of course pretty sure that in the past we had "sane outputs from crab status" also when DBS was returning HTTP 502. So is this maybe really due to change in TW ? Surely changing TW is a much faster fix than deploying new cllient.

belforte commented 2 years ago

hmm... the cmsbot example above was from crab-dev-tw01 where message is truncated to 7500 char. Also the truncate message is in any case added to the message, would not override the reporting of HTTP 502 at the beginning which is the problem here. I am puzzled.

belforte commented 2 years ago

if I remove the |tee /dev/stderr I get:

belforte@lxplus785/bot> crab status 
CRAB project directory:     /afs/cern.ch/work/b/belforte/CRAB3/TC3/dbg/bug/crab_K892_pT30
Task name:          220317_212302:mstojano_crab_K892_pT30
Grid scheduler - Task Worker:   N/A yet - crab-prod-tw01
Status on the CRAB server:  SUBMITFAILED
Task URL to use for HELP:   https://cmsweb.cern.ch/crabserver/ui/task/220317_212302%3Amstojano_crab_K892_pT30
Dashboard monitoring URL:   https://monit-grafana.cern.ch/d/cmsTMDetail/cms-task-monitoring-task-view?orgId=11&var-user=mstojano&var-task=220317_212302%3Amstojano_crab_K892_pT30&from=1647548582000&to=now
Failure message from server:    Problem handling 220317_212302:mstojano_crab_K892_pT30 because of DBSReaderError
                Message: Can't contact DBS at https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader, got errors Instantiating DBS3Reader failed with HTTP Error 502: 
                URL=https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/serverinfo
                Code=502
                Message=Proxy Error
                Header=HTTP/1.1 502 Proxy Error
                Date: Thu, 17 Mar 2022 21:23:35 GMT
                Server: Apache
                Content-Length: 497
                Content-Type: text/html; charset=iso-8859-1

                Body=<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
                <html><head>
                <title>502 Proxy Error</title>
                </head><body>
                <h1>Proxy Error</h1>
                <p>The proxy server received an invalid
                response from an upstream server.<br />
                The proxy server could not handle the request <em><a href="/auth/complete/dbs/prod/global/DBSReader/serverinfo">GET&nbsp;/auth/complete/dbs/prod/global/DBSReader/serverinfo</a></em>.<p>
                Reason: <strong>DNS lookup failure for: cmsweb-k8s-prodsrv.cern.ch</strong></p></p>
                </body></html>

                    ClassName : None
                    ModuleName : WMCore.Services.DBS.DBSErrors
                    MethodName : __init__
                    ClassInstance : None
                    FileName : /data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/WMCore/Services/DBS/DBSErrors.py
                    LineNumber : 38
                    ErrorNr : 1002
                 failure, traceback follows
                Traceback (most recent call last):
                  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.7/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 445, in __callServer
                    self.http_response = method_func(self.url, method, params, data, request_headers)
                  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/py3-dbs3-pycurl/3.17.7-comp/lib/python3.8/site-packages/RestClient/RestApi.py", line 36, in get
                    return http_request(self._curl)
                  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/py3-dbs3-pycurl/3.17.7-comp/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 62, in __call__
                    raise HTTPError(effective_url, http_code, http_response.msg, http_response.raw_header, http_response.body)
                RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 502: Proxy Error

                During handling of the above exception, another exception occurred:

                Traceback (most recent call last):
                  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/WMCore/Services/DBS/DBSReader.py", line 19, in DBSReader
                    dbs.dbs.serverinfo()
                  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.7/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 1838, in serverinfo
                    return self.__callServer("serverinfo")
                  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.7/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 457, in __callServer
                    self.__parseForException(data)
                  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.7/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 483, in __parseForException
                    raise http_error
                RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 502: 
                URL=https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/serverinfo
                Code=502
                Message=Proxy Error
                Header=HTTP/1.1 502 Proxy Error
                Date: Thu, 17 Mar 2022 21:23:35 GMT
                Server: Apache
                Content-Length: 497
                Content-Type: text/html; charset=iso-8859-1

                Body=<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
                <html><head>
                <title>502 Proxy Error</title>
                </head><body>
                <h1>Proxy Error</h1>
                <p>The proxy server received an invalid
                response from an upstream server.<br />
                The proxy server could not handle the request <em><a href="/auth/complete/dbs/prod/global/DBSReader/serverinfo">GET&nbsp;/auth/complete/dbs/prod/global/DBSReader/serverinfo</a></em>.<p>
                Reason: <strong>DNS lookup failure for: cmsweb-k8s-prodsrv.cern.ch</strong></p></p>
                </body></html>

                During handling of the above exception, another exception occurred:

                Traceback (most recent call last):
                  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/TaskWorker/Actions/Handler.py", line 80, in executeAction
                    output = work.execute(nextinput, task=self._task, tempDir=self.tempDir)
                  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/TaskWorker/Actions/DBSDataDiscovery.py", line 243, in execute
                    result = self.executeInternal(*args, **kwargs)
                  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/TaskWorker/Actions/DBSDataDiscovery.py", line 261, in executeInternal
                    self.dbs = DBSReader(dbsurl)
                  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/WMCore/Services/DBS/DBSReader.py", line 23, in DBSReader
                    raise DBSReaderError("Can't contact DBS at %s, got errors %s" % (endpoint, msg))
                WMCore.Services.DBS.DBSErrors.DBSReaderError: DBSReaderError
                Message: Can't contact DBS at https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader, got errors Instantiating DBS3Reader failed with HTTP Error 502: 
                URL=https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/serverinfo
                Code=502
                Message=Proxy Error
                Header=HTTP/1.1 502 Proxy Error
                Date: Thu, 17 Mar 2022 21:23:35 GMT
                Server: Apache
                Content-Length: 497
                Content-Type: text/html; charset=iso-8859-1

                Body=<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
                <html><head>
                <title>502 Proxy Error</title>
                </head><body>
                <h1>Proxy Error</h1>
                <p>The proxy server received an invalid
                response from an upstream server.<br />
                The proxy server could not handle the request <em><a href="/auth/complete/dbs/prod/global/DBSReader/serverinfo">GET&nbsp;/auth/complete/dbs/prod/global/DBSReader/serverinfo</a></em>.<p>
                Reason: <strong>DNS lookup failure for: cmsweb-k8s-prodsrv.cern.ch</strong></p></p>
                </body></html>

                    ClassName : None
                    ModuleName : WMCore.Services.DBS.DBSErrors
                    MethodName : __init__
                    ClassInstance : None
                    FileName : /data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/WMCore/Services/DBS/DBSErrors.py
                    LineNumber : 38
                    ErrorNr : 1002

Log file is /afs/cern.ch/work/b/belforte/CRAB3/TC3/dbg/bug/crab_K892_pT30/crab.log
belforte@lxplus785/bot> 
belforte commented 2 years ago

It really looks like TW used to upload a much more terse message :-( But at least in this way it prints the much needed: Status on the CRAB server: SUBMITFAILED

belforte commented 2 years ago

FWIW, the message in https://github.com/dmwm/CRABClient/issues/5159#issuecomment-1071740843 is "only" 7003 char long (from the DB, i.e. measured at https://github.com/dmwm/CRABClient/blob/37caaf2fbfdec47cecc7c9a5820f14e3252ca11e/src/python/CRABClient/Commands/status.py#L416 )

belforte commented 2 years ago

a quick test indicates that truncating message at 1K char is plenty. FYI @mapellidario I suggest to cut at 1K also in the famous PR for b64 https://github.com/dmwm/CRABServer/pull/7106

belforte commented 2 years ago

hmm... maybe WMCore changed and now reports the full http details instead of only one line like Message: Can't contact DBS at https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader, got errors Instantiating DBS3Reader failed with HTTP Error 502 ?

belforte commented 2 years ago

Dario and myself debugged this. The change is not due to WMCore, but to new DBSClient which prints a new exception message which (besides being too long https://github.com/dmwm/DBSClient/issues/68 ) contains the text from the http header Header=HTTP/1.1 402 or whatever exit code. This is eventually parsed by crab client and triggers the failure.

belforte commented 2 years ago

We have deployed a patched TaskWorker which does not upload such messages anymore. It replaces HTTP/1.1 with HTTP 1.1 which is safe for crab status. So this is not critical anymore. It may still be good to avoid parsing curl stdout for HTTP code, but we need to test well any change to CRABRestInterface, since all in all Daina may have had good reasons to use this https://github.com/dmwm/CRABClient/blob/37caaf2fbfdec47cecc7c9a5820f14e3252ca11e/src/python/CRABClient/CrabRestInterface.py#L200

belforte commented 2 years ago

I am inclined to do the following:

@mapellidario I welcome your opinion