Closed belforte closed 5 months ago
best IMHO is to have only one script, only one test, and when remove SPOOL_DIR, remove WEB_DIR too
this is somehow urgent since we are close to critical in vocms0195 and doing bad overall
vocms0106 : /dev/vdb 1008G 691G 267G 73% /home/grid
vocms0107 : /dev/vdb 1008G 623G 335G 66% /home/grid
vocms0119 : /dev/vdb 1008G 562G 395G 59% /home/grid
vocms0120 : /dev/vdb 1008G 693G 265G 73% /home/grid
vocms0121 : /dev/vdb 1008G 654G 304G 69% /home/grid
vocms0122 : /dev/vdb 1008G 602G 356G 63% /home/grid
vocms0137 : /dev/vdb 1008G 650G 308G 68% /home/grid
vocms0144 : /dev/vdb 1008G 640G 318G 67% /home/grid
vocms0155 : /dev/vdb 1008G 665G 292G 70% /home/grid
vocms0194 : /dev/vdb 1008G 647G 310G 68% /home/grid
vocms0195 : /dev/vdb 1008G 768G 190G 81% /home/grid
vocms0196 : /dev/vdb 1008G 658G 300G 69% /home/grid
vocms0197 : /dev/vdb 1008G 694G 263G 73% /home/grid
vocms0198 : /dev/vdb 1008G 698G 260G 73% /home/grid
vocms0199 : /dev/vdb 1008G 669G 288G 70% /home/grid
hmm... I got it all wrong !!!
the way things work is:
cleanup_home_grid.sh
is unclear, possibly was put there to catch anything that slipped through, that's why it uses a safe 60d thresholdclean_condor_spool_dir.sh
is similarly unclear. It removes SPOOL_DIR very soon (two weeks afer task completed) and maybe it was put/kept there to take care of early cleanup of HammerCloud, which is fine, but should not be mixedSo I'd rather say that:
CRAB_TaskEndTime
classAd when we submit them
and make no changes to JobCleanup.py
. We can do some emergency cleanup by hand in the meanwhile.OTOH I am puzzled that we do not see problems because of early disappearance of SPOOL_DIR. Something more to understand?
In a way I am remembering, re-discovering, re-proposing the "old" plan from https://github.com/dmwm/CRABServer/issues/4681#issuecomment-302336451 !!
What I do not see is how clean_condor_spool_dir.sh
came about
I was REALLY wrong. There is no crontab running the obsolete clean_condor_spool_dir.sh
which was superseded by JobCleanup.py
.
Yet I found WEB_DIRs with logs in it and symlink pointing to removed spool directories !
And /var/log/crab/JobCleanup.log
only has lines about spool dir being removed, not home dir
OK, Finally I found it in https://gitlab.cern.ch/ai/it-puppet-hostgroup-vocmsglidein/-/blob/qa/code/files/crabschedd/JobCleanup.py?ref_type=heads#L56
homeDir="/home/grid/%s/%s"%(crabDag.task["Owner"], crabDag.task["Crab_ReqName"])
Now all tasks run with Owner = crabtw
!! So that variable always points to non-existing directory names and all WEB_DIRs are only catched by the fall-back cleanup_home_dir.sh
after 60 days.
Need to change to
homeDir="/home/grid/%s/%s"%(crabDag.task["CRAB_UserHN"], crabDag.task["Crab_ReqName"])
I better close this and open an ad-hoc issue w/o all the noise
WEB_DIR:
in https://gitlab.cern.ch/ai/it-puppet-hostgroup-vocmsglidein/-/blob/qa/code/files/crabschedd/cleanup_home_grid.sh?ref_type=heads we remove directories last changed 60 days ago or more
SPOOL_DIR
in https://gitlab.cern.ch/ai/it-puppet-hostgroup-vocmsglidein/-/blob/qa/code/files/crabschedd/clean_condor_spool_dir.sh?ref_type=heads we remove 14 days after task is over By the way, we do not cleanup HC after 7 days since now it runs with owner
crabtw
like everything else, need to filter on e.g.crab_userhn
or (better) onCRAB_UserRole = "production"