dmwm / CRABServer

15 stars 38 forks source link

refactor crab resubmit #6270

Open belforte opened 3 years ago

belforte commented 3 years ago

avoid editing dagman log and user new Dagman features developed by Mark C. Once we have agreed on the semantyc and the new Dagman is available ...

WIll be a cleaner and definitive solution to https://github.com/dmwm/CRABServer/issues/5876

belforte commented 3 years ago

Raise priority, time to start looking at this seriously

belforte commented 3 years ago

for convenience, paste here conclusion from #5876 After discussion with HTCondor developers we came to these conclusion:

  1. the problem was not there in 2014 when Brian initially coded this
  2. the problem came when condor got smarter about writing logs and stopped locking them by default, opening the way for our log editing procedure (which attempts to use condor file locking API) to overwrite a log file with an "old" version where some events are missing
  3. condor can be configured to revert to old behavior by setting ENABLE_USERLOG_LOCKING=True in its configuration so that we do not have the problem
  4. we made that change in all CRAB schedd's and did not find any sign of increased load or slower operations, so we can run in that way for a while. Ref. https://cms-logbook.cern.ch/elog/Analysis+Operations/3282
  5. editing logs is bad anyhow and it has been agreed to enhance DAGMAN functionality to allow CRAB to do resubmits w/o tampering with files it should not tamper with. Discussion on this has started with HTCondor DAGMAN expert Mark Coatsworth: https://docs.google.com/document/d/1vgJApmjkH9brYhQZbdRooGnj2mqSrme7BWVxzReZ0oM/edit
belforte commented 3 years ago

will look at this after transition to py3

belforte commented 3 years ago

Update from Mark C.


On 22/02/2021 18:40, Mark Coatsworth wrote:
> Hi Stefano, long overdue update on this work (replacing the old CMS
> CRAB log editing mechanism).
> 
> I had a couple false starts but finally implemented the
> DAGMAN_PUT_FAILED_JOBS_ON_HOLD mechanic that we discussed. It's fairly
> simple: when this option is set to True, DAGMan will put failed jobs
> on hold instead of aborting the dag. This gives CRAB the opportunity
> to fix the problem and continue processing the dag (hence, no need to
> edit the log to re-run failed nodes).
> 
> The new feature will ship in the 9.1 release of Condor later this spring.
> 
> Please keep me posted where things are at on your end. I know this
> will involve some changes in the CRAB code, and likely some tweaks on
> the DAGMan side also. I'd be happy to help with this when the time
> comes,
> 
> Mark
belforte commented 3 years ago

@dciangot since you manifested interest on this, I add you to assignees so you can keep track

belforte commented 2 years ago

note also this recent thread in htcondor forum https://lists.cs.wisc.edu/archive/htcondor-users/2022-February/msg00015.shtml

belforte commented 1 year ago

with ref to last line in https://github.com/dmwm/CRABServer/issues/6270#issuecomment-760194316 2 years later is better make a copy of the old googleDoc in condor space, just in case: original: https://docs.google.com/document/d/1vgJApmjkH9brYhQZbdRooGnj2mqSrme7BWVxzReZ0oM/edit copy in my drive: https://docs.google.com/document/d/1qsil0UGewazg96cA-KP1QVOwkIcjSI6-ceT-qMsOYa0/edit?usp=sharing

belforte commented 4 months ago

this should be more straightforward now that we have decided not to allow resubmission of successful jobs https://github.com/dmwm/CRABClient/issues/5285