dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

MSTransferor: Being able to ban one or more sites from input staging #10477

Open haozturk opened 3 years ago

haozturk commented 3 years ago

Impact of the new feature Production workflows whose input is staged to a site which has issues with storage/network

Is your feature request related to a problem? Please describe. Yes, recently, we have seen an issue w/ T2_ES_IFCA site which is explained here [1] Although, the input files exist and accessible over xrootd, production workflows cannot read them due this issue. Normally, we should be able to fix these issues in a reasonable amount of time, but this problem with IFCA remains unsolved for almost 2 months, therefore it delays bunch of workflows

[1] https://ggus.eu/?mode=ticket_info&ticket_id=151314

Describe the solution you'd like In case we see a long standing storage/network issue with a site which prevents production jobs to read input, we should be able to stop input staging to this site until this issue is fixed to avoid further failures.

Describe alternatives you've considered None

Additional context It's possible that this feature request is not the best solution to such issues. If you any other suggestions, please feel free to raise them.

klannon commented 3 years ago

Why is a site whose network or storage are inaccessible not put into downtime? Aren't network and storage required for a site to be consider available for production?

nsmith- commented 3 years ago

I agree, we should discuss this issue with site support: if the site cannot reliably read its files locally it should be in downtime. Downtimes also provide additional motivation for the site to repair because they are tracked and reported to funding agencies.

amaltaro commented 3 years ago

Replicating comments made in the Q3 document plan, by Hasan. We better put it on hold and have further discussions (also as updated by Nick/Kevin above).