dmwm / CRAB2

CRAB2
2 stars 11 forks source link

GLiteStatusQuery fails when aborted reason contains double quote #716

Closed ericvaandering closed 11 years ago

ericvaandering commented 11 years ago

Original Savannah ticket 83874 reported by belforte on Fri Jul 1 03:24:14 2011.

if some of the args returned by LB query (e.g. Reason) contains double quote (e.g. messages like

) GLitesStatusQuery.py returns invalid json.

This is a tested patch that fixes this: RCS file: /cvs_server/repositories/CMSSW/COMP/PRODCOMMON/src/python/ProdCommon/BossLite/Scheduler/GLiteStatusQuery.py,v retrieving revision 1.16 diff -r1.16 GLiteStatusQuery.py 58c58,60

< tmp = tmp.replace('\'','"')

> tmp = tmp.replace('\'','\v') > tmp = tmp.replace('"','\'') > tmp = tmp.replace('\v','"') belforte@lxplus306/Scheduler>

Proper solution requires more work ad discussed in mail

But I do not like that those format problems are addressed via myJSONEncoder() which apparently may not be called. If this code is needed, should not rely on WMCore.Wrappers to not have it before using it. Why is this try needed ? Since in the end it looks we always use myJSONEncoder and it works, I suggest to get rid of this asap.

try : from WMCore.Wrappers import JsonWrapper as json except: json = myJSONEncoder()

ericvaandering commented 11 years ago

Comment by mcinquil on Thu Jul 7 04:02:11 2011

I cannot replicate. For the moment I am forcing to put the reason that gave the failure into GLiteStatusQuery (since to get that reason from real I would have to reprodice the cream failure).

[lxplus421] ~/JDL/test $ python2.4 Python 2.4.3 (#1, May 6 2011, 09:25:14) [GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from ProdCommon.BossLite.Common.System import executeCommand >>> outJson, ret = executeCommand('python2.4 /afs/cern.ch/user/m/mcinquil/JDL/test/GLiteStatusQuery.py --jobId=https://wms202.cern.ch:9000/DFhEdRMheAXWCGYOEzIJ-A') >>> print outJson {"https://wms202.cern.ch:9000/DFhEdRMheAXWCGYOEzIJ-A&quot;: {"status": "SW", "scheduledAtSite": null, "service": "https://wms202.cern.ch:7443/glite_wms_wmproxy_server&quot;, "statusScheduler": "Waiting", "destination": "", "statusReason": "Transfer to CREAM failed due to exception: CREAM Register returned error "MethodName=[jobRegister] Timestamp=[Wed 29 Jun 2011 17:06:33] ErrorCode=[0] Description=[system error] FaultCause=[org.glite.ce.creamapi.jobmanagement.JobManagementException: org.glite.ce.common.db.DatabaseException: Rollback executed due to: Lock wait timeout exceeded; try restarting transaction]", "schedulerParentId": null, "schedulerId": "https://wms202.cern.ch:9000/DFhEdRMheAXWCGYOEzIJ-A&quot;, "lbTimestamp": "1310024582", "startTime": null, "stopTime": null}} >>> from WMCore.Wrappers import JsonWrapper as json >>> json.loads(outJson) {u'https://wms202.cern.ch:9000/DFhEdRMheAXWCGYOEzIJ-A': {u'status': u'SW', u'scheduledAtSite': None, u'service': u'https://wms202.cern.ch:7443/glite_wms_wmproxy_server', u'statusScheduler': u'Waiting', u'destination': u'', u'statusReason': u'Transfer to CREAM failed due to exception: CREAM Register returned error "MethodName=[jobRegister] Timestamp=[Wed 29 Jun 2011 17:06:33] ErrorCode=[0] Description=[system error] FaultCause=[org.glite.ce.creamapi.jobmanagement.JobManagementException: org.glite.ce.common.db.DatabaseException: Rollback executed due to: Lock wait timeout exceeded; try restarting transaction]', u'schedulerParentId': None, u'schedulerId': u'https://wms202.cern.ch:9000/DFhEdRMheAXWCGYOEzIJ-A', u'lbTimestamp': u'1310024582', u'startTime': None, u'stopTime': None}}

In the end: myJSONEncoder is never used (I have no idea why it should there a part of the comment in the code 'load ad-hoc JSON encoder if there are some problems with WMCore').

Can you give us the complete logging info of the job that gave the failure? One dangerous thing I can see it the string composition begin done into getAbortReason to get the statusReason.

ericvaandering commented 11 years ago

Comment by belforte on Thu Jul 7 04:29:27 2011

I will investigate. At first sight sounds like in your test this works from WMCore.Wrappers import JsonWrapper as json while in the CrabServer instance I was using it was falling back to myJSONEncoder()

Clearly there are (or at least there have been) situations where myJSONEncoder was being used, given that already fixes like the one for "don't" were put there, and tested to work. And this may boil down to be a problem with dependencies and such. Getting better dependencies is surely better then patching code, though it may be a bit more work

It is strange that in your outjson of the two " in the CREAM report one is escaped with " and one is vanished. Now I try to find the details

stefano

ericvaandering commented 11 years ago

Comment by mcinquil on Thu Jul 7 04:35:11 2011

Can you import correctly in the current server release WMCore.Wrappers.JsonWrapper package:

from WMCore.Wrappers import JsonWrapper as json

?

And if it works, which json package do you get with:

_module = "json" try: import cjson _module = "cjson" except: pass try: import json except: import simplejson as json

print _module, json

?

ericvaandering commented 11 years ago

Comment by belforte on Thu Jul 7 05:00:55 2011

seems it all works...

[crab@vocms22 CRABSERVER_1_1_3_old]$ source BeforeStart.sh [crab@vocms22 CRABSERVER_1_1_3_old]$ python Python 2.6.4 (r264:75706, Mar 31 2010, 15:05:02) [GCC 4.3.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from WMCore.Wrappers import JsonWrapper as json

no error and then: >>> print _module, json json <module 'json' from '/home/crab/CRABSERVER_1_1_3/sw_area/slc5_amd64_gcc434/external/python/2.6.4-cmp4/lib/python2.6/json/init.pyc'>

same no error on vocm21 (production crab server with 113) and

>>> print _module, json json <module 'json' from '/home/crab/sw_area/slc5_amd64_gcc434/external/python/2.6.4-cmp4/lib/python2.6/json/init.pyc'>

on vocms21 Now I am quite puzzled.

stefano

ericvaandering commented 11 years ago

Comment by belforte on Thu Jul 7 07:20:43 2011

as far as reproducing, the job in the first report is still available, so I can reproduce. I can give you a proxy to do it. But since info will be cleared from LB sometime soon anyhow, I attach the output of glite-wms-job-status and loggingInfo

(file #20293, file #20294)

ericvaandering commented 11 years ago

Comment by mcinquil on Thu Jul 7 07:26:50 2011

Can you give us the exact error you see? (I do not think I have seen it not in the e-mail not in this ticket) Do you get the error even executing this for that job?

from ProdCommon.BossLite.Common.System import executeCommand outJson, ret = executeCommand('python2.4 /afs/cern.ch/user/m/mcinquil/JDL/test/GLiteStatusQuery.py --jobId=https://wms202.cern.ch:9000/DFhEdRMheAXWCGYOEzIJ-A') print outJson from WMCore.Wrappers import JsonWrapper as json a = json.loads(outJson)

print a

If in this case works, while through SchedulerGLite gives an error, it means that "hackTheEnv" removes something and then we need further investigation because it shouldn't remove it (but just revert the PYTHONPATH removing duplicates).

ericvaandering commented 11 years ago

Comment by belforte on Thu Jul 7 07:41:24 2011

here it is a close to what SchedulerGlite.py does as I think I can make it w/i trying too hard. Of course needs my proxy to work, but I gave you one, this is only for recording of the error:

On a fresh lxplus shell, do:

belforte@lxplus401/~> source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh belforte@lxplus401/~> voms-proxy-init --voms cms Enter GRID pass phrase: Your identity: /C=IT/O=INFN/OU=Personal Certificate/L=Trieste/CN=Stefano Belforte Creating temporary proxy .......................................... Done Contacting lcg-voms.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch] "cms" Done Creating proxy ......................................................................... Done Your proxy is valid until Fri Jul 8 02:37:01 2011 belforte@lxplus401/~> python /afs/cern.ch/cms/ccs/wm/scripts/Crab/CRAB_2_7_8_patch1/external/ProdCommon/BossLite/Scheduler/GLiteStatusQuery.py --jobId https://wms221.cern.ch:9000/Dzj32jdr-iISabAISEjPuQ {"https://wms221.cern.ch:9000/Dzj32jdr-iISabAISEjPuQ&quot;: {"status": "A", "scheduledAtSite": null, "service": "https://wms221.cern.ch:7443/glite_wms_wmproxy_server&quot;, "statusScheduler": "Aborted", "destination": "node74.datagrid.cea.fr:8443/cream-pbs-cms", "statusReason": "Transfer to CREAM failed due to exception: CREAM Register returned error "MethodName=[jobRegister] Timestamp=[Wed 29 Jun 2011 17:06:33] ErrorCode=[0] Description=[system error] FaultCause=[org.glite.ce.creamapi.jobmanagement.JobManagementException: org.glite.ce.common.db.DatabaseException: Rollback executed due to: Lock wait timeout exceeded; try restarting transaction]".", "schedulerParentId": null, "schedulerId": "https://wms221.cern.ch:9000/Dzj32jdr-iISabAISEjPuQ&quot;, "lbTimestamp": "1309359995", "startTime": null, "stopTime": "1309359995"}} belforte@lxplus401/~>

the output is not json due to the extra "'s

ericvaandering commented 11 years ago

Comment by mcinquil on Thu Jul 7 08:12:57 2011

Still I would like to get rid od myJSONEnconder. To me the solution below works with python2.4 after even sourced only the glite ui.

RCS file: /cvs_server/repositories/CMSSW/COMP/PRODCOMMON/src/python/ProdCommon/BossLite/Scheduler/GLiteStatusQuery.py,v retrieving revision 1.16 diff -r1.16 GLiteStatusQuery.py 385c385

< try :

> try : 387,388c387,393 < except:

< json = myJSONEncoder()

> except Exception, ex: > try: > import json > except ImportError: > import simplejson as json > #json = myJSONEncoder()

ericvaandering commented 11 years ago

Comment by belforte on Mon Jul 11 05:01:08 2011

checkpoint, after some tests and investigations with Mattia:

Problem is confirmed and reproducible.

Getting rid of myJSONEncoder is a good goal, only one obstacle: import simplejson works for client since client tarball distributes it (it is not part of SL5 distribution of python2), but it is not available currently on crab server.

Mattia and Daniele should decide whether to add simplejson to CS (and how) or go for the dirty patch via myJSONEncoder

ericvaandering commented 11 years ago

Comment by mcinquil on Thu Jul 28 05:39:15 2011

I have found a solution almost general without the need to change rpm distribution (the solution is transparent if simplejson starts to be installed correctly), but keeping the myJSONEncoder (+1 if we get rid of it, but it doesn't seems to be possible without deploying simplejson).

Stefano, feel free to test this code: +verbatim+ class MyJSONEncoder: """ easy class able to transform a string representation of a python dictionary in a valid JSON output recognizable by simplejson """ def init(self): pass

def dumps(self, myString):
    &quot;&quot;&quot;
    the same interface as simplejson ...
    &quot;&quot;&quot;

    finalstring = {}
    for schedid in myString:
        finalstring[ schedid ] = {}
        for k in myString[schedid]:
            finalstring[schedid][str(k).replace('&quot;', '').replace(&quot;'&quot;,&quot;&quot;)] = str(myString[schedid][k]).replace('&quot;', '').replace(&quot;'&quot;,&quot;&quot;)
    tmp = str(finalstring)
    tmp = tmp.replace('\'','&quot;')
    tmp = tmp.replace('None','null')

    return tmp

-verbatim-

ericvaandering commented 11 years ago

Comment by belforte on Wed Aug 24 10:30:18 2011

OK this works in our case after changing MyJSONEncoder to myJSONEncoder :-) I have decided I will not try to outsmart you by writing something that parses dictionary rather then strings.

Please commit. Stefano

ericvaandering commented 11 years ago

Comment by mcinquil on Thu Aug 25 05:24:55 2011

Just committed. Consider that if you add a valid json package to the server setup then this is going to be used instead of the OurJSONEncoder.py

ps: I am leaving assigned to you so you can eventually cross-check and track it; I mark it has fixed but I am not closing it.

ericvaandering commented 11 years ago

Comment by belforte on Tue Oct 4 05:52:38 2011

This fix is not compatible with CrabServer, as reported offline the first time the job status is queried, TaskTracking component fails with an odd error and wipes clean the DataBase entry for the jobs. A different solution needs to be pursued.

Below is the example of calling the Status Query twice in a row in the CrabServer environment:

[crab@vocms20 JobTracking]$ python $PRODAGENT_ROOT/lib/JobTracking/QueryStatus.py 2 1:1 SchedulerGLite /tmp/del_proxies/e9abcec6847377280a53bb8dd6dc381b7aa5ab6b checking status of task 2 .... PARENT: https://lb006.cnaf.infn.it:9000/TBIlLqmtvAe4Ewo-TugHxQ /home/crab/CRABSERVER_1_1_4/sw_area/new/PRODCOMMON/lib/ProdCommon/Database/SafeSession.py:161: Warning: Data truncated for column 'start_time' at row 1 self.cursor.execute(query) /home/crab/CRABSERVER_1_1_4/sw_area/new/PRODCOMMON/lib/ProdCommon/Database/SafeSession.py:161: Warning: Data truncated for column 'stop_time' at row 1 self.cursor.execute(query) 1 https://lb006.cnaf.infn.it:9000/pOwrUn8Roktl6YrSamY-0A Scheduled SS unavailable 2011-10-01 20:31:00 [crab@vocms20 JobTracking]$

now issue same command again:

[crab@vocms20 JobTracking]$ python $PRODAGENT_ROOT/lib/JobTracking/QueryStatus.py 2 1:1 SchedulerGLite /tmp/del_proxies/e9abcec6847377280a53bb8dd6dc381b7aa5ab6b checking status of task 2 .... PARENT: null Error in 2 status query Log Event type : error Command : None description : error executing GLiteStatusQuery message : Error during API calls.

['skipping null : Unable to retrieve the status for: null\nJobId: bad argument (null)'] warnings : ["Errors: Error during API calls.\n\n['skipping null : Unable to retrieve the status for: null\nJobId: bad argument (null)']"]

[crab@vocms20 JobTracking]$

ericvaandering commented 11 years ago

Comment by belforte on Wed Oct 5 05:08:20 2011

I am proposing here a different solution, that should prevent the problem with crab server. Since issue was coming from CREAM CE inserting messages with quotes " in the abortReason, I am leaving everything as it is in myJSONEncoder (for bad/ugly that it may be), i.e. reverting GLiteStatusQuery.py to rev. 1.16 and on that simply sanitize directly the abort reason in the point where it is fetched from LB and currently implicitely converted in string. Rather I am putting explicit type conversion and stripping all single and double quotes from the reason text, so that in the end is a very normal string. I verified that this works, even when iterated, for a job which had the "quotes in abort reason" issue. Have not put this on Bari server for more extended test.

Here's the fix detail: belforte@lxplus315/Scheduler> cvs diff -r 1.16 GLiteStatusQuery.py

Index: GLiteStatusQuery.py

RCS file: /cvs_server/repositories/CMSSW/COMP/PRODCOMMON/src/python/ProdCommon/BossLite/Scheduler/GLiteStatusQuery.py,v retrieving revision 1.16 diff -r1.16 GLiteStatusQuery.py 56d55 <
361c360,364

< reason = reason + oneReason + "."

> # make type conversion to string explicit > # and remove all " and ' from oneReason > oneReasonString=str(oneReason).replace('"','') > oneReasonString=oneReasonString.replace('\'','') > reason = reason + oneReasonString + "." belforte@lxplus315/Scheduler>

I am waiting for comments from developers.

ericvaandering commented 11 years ago

Comment by mcinquil on Wed Oct 5 05:11:10 2011

Stefano, as long as this works and you are happy with it this is fine to me.

ericvaandering commented 11 years ago

Comment by belforte on Wed Oct 5 05:27:07 2011

thanks for the trust Mattia, but I asked because there is always the chance that I miss something that may be obvious to you even before I run the tests. As you know no test can fully certify all changes. By the way. Any possibility that the thing that was confusing CrabServer may also impact CrabClient DB ? Should we delay relasing 279 until this fix of my is fully tested ? stefano

ericvaandering commented 11 years ago

Comment by mcinquil on Wed Oct 5 05:40:07 2011

I am not sure if the reason of the previous failures was 100% understood. But, as long as the new fix works, I think that if the proper tests are done the fix can go in (both: standalone and server). Then proper validations have to be done and no hand patching of the server should be performed. I am not sure on next release status, so I will leave someone else to comment on that.

ericvaandering commented 11 years ago

Comment by belforte on Wed Oct 5 06:29:27 2011

committed, will test more on Bari server and should go in client 279pre5 for testsing there /cvs_server/repositories/CMSSW/COMP/PRODCOMMON/src/python/ProdCommon/BossLite/Scheduler/GLiteStatusQuery.py,v <-- GLiteStatusQuery.py new revision: 1.18; previous revision: 1.17

ericvaandering commented 11 years ago

Comment by belforte on Fri Oct 7 05:01:43 2011

20+ taks all OK on Bari server, including a couple of users who submitted even it it is no in the list. Tested also on client 279pre4.

ericvaandering commented 11 years ago

Comment by belforte on Tue Nov 1 18:22:49 2011

in production with latest ProdCommon tag as client 279

ericvaandering commented 11 years ago

Closed by belforte on Tue Nov 1 18:22:49 2011