columnflow / columnflow

Backend for columnar, fully orchestrated HEP analyses with pure Python, law and order.
https://columnflow.readthedocs.io
BSD 3-Clause "New" or "Revised" License
25 stars 24 forks source link

HTCondor not working on lxplus #554

Open pkausw opened 2 weeks ago

pkausw commented 2 weeks ago

As pointed out by @gsaha009 in issue #552 , htcondor is currently not working on lxplus. Since this issue is unrelated to the original topic of #552, I'm moving the information in the other thread here.

Here is the error trace that @gsaha009 has provided:

INFO: luigi-interface - [pid 2029110] Worker Worker(salt=5874309394, workers=1, host=lxplus935.cern.ch, username=gsaha, pid=2029110) running   cf.CalibrateEvents(effective_workflow=htcondor, branch=-1, analysis=httcp.config.analysis_httcp.analysis_httcp, version=Oct23_v5, config=run3_2022_preEE_nano_cp_tau_v12_limited, shift=nominal, local_shift=nominal, dataset=h_ggf_tautau_uncorrelated_filter, calibrator=main, workflow=htcondor)
going to submit 1 htcondor job(s), run3_2022_preEE_nano_cp_tau_v12_limited, h_ggf_tautau_uncorrelated_filter
ERROR: luigi-interface - [pid 2029110] Worker Worker(salt=5874309394, workers=1, host=lxplus935.cern.ch, username=gsaha, pid=2029110) failed    cf.CalibrateEvents(effective_workflow=htcondor, branch=-1, analysis=httcp.config.analysis_httcp.analysis_httcp, version=Oct23_v5, config=run3_2022_preEE_nano_cp_tau_v12_limited, shift=nominal, local_shift=nominal, dataset=h_ggf_tautau_uncorrelated_filter, calibrator=main, workflow=htcondor)
Traceback (most recent call last):
  File "/eos/user/g/gsaha/CPinHToTauTauOutput/software/venvs/cf_dev_9b04c75c/lib/python3.9/site-packages/luigi/worker.py", line 210, in run
    new_deps = self._run_get_new_deps()
  File "/eos/user/g/gsaha/CPinHToTauTauOutput/software/venvs/cf_dev_9b04c75c/lib/python3.9/site-packages/luigi/worker.py", line 138, in _run_get_new_deps
    task_gen = self.task.run()
  File "/afs/cern.ch/work/g/gsaha/public/IPHC/Work/ColumnFlowAnalyses/CPinHToTauTau/modules/columnflow/modules/law/law/workflow/remote.py", line 628, in run
    return self._run_impl()
  File "/afs/cern.ch/work/g/gsaha/public/IPHC/Work/ColumnFlowAnalyses/CPinHToTauTau/modules/columnflow/modules/law/law/workflow/remote.py", line 700, in _run_impl
    self.submit()
  File "/afs/cern.ch/work/g/gsaha/public/IPHC/Work/ColumnFlowAnalyses/CPinHToTauTau/modules/columnflow/modules/law/law/workflow/remote.py", line 882, in submit
    job_ids, submission_data = self._submit_group(submit_jobs)
  File "/afs/cern.ch/work/g/gsaha/public/IPHC/Work/ColumnFlowAnalyses/CPinHToTauTau/modules/columnflow/modules/law/law/contrib/htcondor/workflow.py", line 190, in _submit_group
    c, p = job_id.split(".")
AttributeError: 'Exception' object has no attribute 'split'
INFO: luigi-interface - Informed scheduler that task   cf.CalibrateEvents_httcp_config_ana__1__f25b583b5f   has status   FAILED
INFO: luigi-interface - Worker Worker(salt=5874309394, workers=1, host=lxplus935.cern.ch, username=gsaha, pid=2029110) was stopped. Shutting down Keep-Alive thread
INFO: luigi-interface - 
===== Luigi Execution Summary =====
gsaha009 commented 2 weeks ago

Thanks @pkausw for keeping it at high priority. I just wanted to add that I have managed to bypass this error by this strategy. I have checked that the jobs are running both locally and on HTCondor. Only caveat is that, the parquet or pickle i.e. the big files are being saved in eos as expected, but all other output are still in afs.

pkausw commented 2 days ago

Hi @gsaha009 , @riga pushed a fix for this issue in law: https://github.com/riga/law/commit/3a8a1651ee20d3c54a8e3611baddf01bb0cfb840. Can you update to the latest master version of cf (commit 273b0ba) and try again?