dmwm / CRABServer

15 stars 37 forks source link

automatic splitting fails in latest master after #8507 #8509

Closed belforte closed 1 week ago

belforte commented 1 week ago

this task: https://cmsweb-test2.cern.ch/crabserver/ui/task/240618_133627%3Acmsbot_crab_20240618_153626 PreDag fails to start the processing DAG https://cmsweb.cern.ch:8443/scheddmon/059/cmsbot/240618_133627:cmsbot_crab_20240618_153626/AutomaticSplitting/DagLog0.txt

relevant error line should be

ERROR: unable to read submit append file (subdag.ad)
belforte commented 1 week ago

this change had gone lost in merging of my modernize_HTC branch to master branch

@@ -333,7 +333,7 @@ class PreDAG():
         """ Submit a subdag
         """
         subprocess.check_call(['condor_submit_dag', '-DoRecov', '-AutoRescue', '0', '-MaxPre', '20', '-MaxIdle', str(maxidle),
-                               '-MaxPost', str(maxpost), '-insert_sub_file', 'subdag.ad',
+                               '-MaxPost', str(maxpost), '-insert_sub_file', 'subdag.jdl',
                                '-append', '+Environment = strcat(Environment," _CONDOR_DAGMAN_LOG={0}/{1}.dagman.out")'.format(os.getcwd(), subdag),
                                '-append', '+TaskType = "{0}"'.format(stage.upper()), subdag])
belforte commented 1 week ago

DAG submission is now OK. BUt the correct proxy is not passed to FTS_transfer script. The directory name is inserted twice ! Things were OK for non-automatic splitting. I must be missing some other change previously done in my development branch

2024-06-18 19:29:02,232: using user's proxy from /data/srv/glidecondor/condor_local/spool/6222/0/cluster9636222.proc0.subproc0//data/srv/glidecondor/condor_local/spool/6222/0/cluster9636222.proc0.subproc0/c6ea75e904ebb26531217d954bd562c96db20c11
2024-06-18 19:29:02,232: error during main loop
Traceback (most recent call last):
  File "/data/srv/glidecondor/condor_local/spool/6222/0/cluster9636222.proc0.subproc0/task_process/FTS_Transfers.py", line 715, in <module>
    algorithm()
  File "/data/srv/glidecondor/condor_local/spool/6222/0/cluster9636222.proc0.subproc0/task_process/FTS_Transfers.py", line 665, in algorithm
    ftsContext = fts3.Context(FTS_ENDPOINT, proxy, proxy, verify=True)
  File "/usr/lib/python3.9/site-packages/fts3/rest/client/context.py", line 182, in __init__
    self._set_x509(ucert, ukey)
  File "/usr/lib/python3.9/site-packages/fts3/rest/client/context.py", line 95, in _set_x509
    raise FileNotFoundError(name + " not found!")
FileNotFoundError: Certificate not found!
belforte commented 1 week ago

indeed

[crabtw@vocms059 task_process]$ cat RestInfoForFileTransfers.json |jq
{
  "host": "cmsweb-test2.cern.ch:8443",
  "dbInstance": "dev",
  "proxyfile": "/data/srv/glidecondor/condor_local/spool/6222/0/cluster9636222.proc0.subproc0/c6ea75e904ebb26531217d954bd562c96db20c11"
}
[crabtw@vocms059 task_process]$ 

But code expect a file name w/o the path https://github.com/dmwm/CRABServer/blob/3f893cf77152dc1bf9ad309bc1cad2411def88e7/scripts/task_process/FTS_Transfers.py#L73-L75

belforte commented 1 week ago

and here's what I am missing. I do not understand how is it possible that other tests in #8508 were successful [1]

LapSB:CRABServer$ git diff modernize_HTC -- src/python/TaskWorker/Actions/PostJob.py
diff --git a/src/python/TaskWorker/Actions/PostJob.py b/src/python/TaskWorker/Actions/PostJob.py
index 5cc59b96..17c235ed 100644
--- a/src/python/TaskWorker/Actions/PostJob.py
+++ b/src/python/TaskWorker/Actions/PostJob.py
@@ -995,7 +995,7 @@ class ASOServerJob():
             #if not os.path.exists('task_process/rest_filetransfers.txt'):
                 restInfo = {'host':self.rest_host,
                             'dbInstance': self.db_instance,
-                            'proxyfile': os.path.basename(self.proxy)}
+                            'proxyfile': self.proxy}
                 with open('task_process/RestInfoForFileTransfers.json', 'w') as fp:
                     json.dump(restInfo, fp)
         else:
@@ -1014,7 +1014,7 @@ class ASOServerJob():
             #if not os.path.exists('task_process/rest_filetransfers.txt'):
                 restInfo = {'host':self.rest_host,
                             'dbInstance': self.db_instance,
-                            'proxyfile': os.path.basename(self.proxy)}
+                            'proxyfile': self.proxy}
                 with open('task_process/RestInfoForFileTransfers.json','w') as fp:
                     json.dump(restInfo, fp)
         return returnMsg
LapSB:CRABServer$ 

[1] I looked at one of those 240618_133632:cmsbot_crab_20240618_153631 and

[crabtw@vocms059 SPOOL_DIR]$ cat task_process/RestInfoForFileTransfers.json |jq
{
  "host": "cmsweb-test2.cern.ch:8443",
  "dbInstance": "dev",
  "proxyfile": "e080907cad69528d423bbe562e2f8f9873b2c933"
}
[crabtw@vocms059 SPOOL_DIR]$ 

but PostJob cose was missing the os.path.basename. I guess that normal jobs use proxy file name from the env. of dag_bootstrap.sh

[crabtw@vocms059 SPOOL_DIR]$ grep  X509 dag_bootstrap.out
X509_USER_PROXY=e080907cad69528d423bbe562e2f8f9873b2c933
[crabtw@vocms059 SPOOL_DIR]$ 

while in case of automatic splitting I find

[crabtw@vocms059 SPOOL_DIR]$ grep  X509 dag_bootstrap.out 
X509_USER_PROXY=/data/srv/glidecondor/condor_local/spool/6222/0/cluster9636222.proc0.subproc0/c6ea75e904ebb26531217d954bd562c96db20c11
[crabtw@vocms059 SPOOL_DIR]$ 

I think that somehow PreDag takes the proxy from the ads of completed probe jobs

finished_jobs/job.0-1.0:x509userproxy = "/data/srv/glidecondor/condor_local/spool/6222/0/cluster9636222.proc0.subproc0/c6ea75e904ebb26531217d954bd562c96db20c11"

But this is quite intricated. And of course the solution which I already found of removing any possible path in PostJob before writing it out in the file to pass to FTS_Transfer (or Rucio_Transfer) is just fine.