DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

Some sites not even running the daily test job #31

Closed StevenCTimm closed 1 year ago

StevenCTimm commented 1 year ago

Workflow allocator is sending out one job per day to each site regardless of whether it's active or not. A number of US sites are not running this test job at the moment, need to investigate

Andrew-McNab-UK commented 1 year ago

The Job Factory in the Workflow System tries to send out one job per day even if there are no matches and the site is not enabled for Workflow jobs, to allow this kind of monitoring. The site info/status page is here: https://wfs.dune.hep.ac.uk/dashboard/?method=list-sites "Last job" on that page is the last time a job started at the site, which is detected by it calling back to the Workflow Allocator to get some work to do.

It may just be that the Generic Jobs sent out by the factory with jobsub_submit are asking for something that will never match, but it could also be sites with problems.

StevenCTimm commented 1 year ago

If I could get a jobsub job id for the test job that is submitted, it would be easier to debug this while it is actually going. Is there a particular time of day that all the test jobs go out?

Andrew-McNab-UK commented 1 year ago

If you click on the name of the site to get the site specific page, there are links to list of jobs at the site in the different states. Submitted = Idle (usually.) They are the ones which never called back to the Workflow Allocator to look for a stage to work on.

StevenCTimm commented 1 year ago

Glideins are getting through to Caltech a.k.a US_Caltech and calling back to the pool so it is not a factory/frontend issue. The glidein exited before the job could match so I didn't figure out why on this trial.

StevenCTimm commented 1 year ago

scratch that--some of these are "glide resource" classads that are coming from the frontend. .there is no record of a successful glidein coming from Caltech in recent time.

can look at all of these in

https://landscape.fnal.gov/monitor/d/000000118/hepcloud-glideins?orgId=1&refresh=5m&var-cluster=cms-t1&var-site=All&var-entry=All&from=now-2d&to=now-5m

and see what the frontend has delivered and what we were able to use. of course you have to map into factory entry space.

Andrew-McNab-UK commented 1 year ago

There was a campaign last year to get more of these enabled. Let's close and reopen as we spot individual missing sites.