Open haozturk opened 3 years ago
Why is a site whose network or storage are inaccessible not put into downtime? Aren't network and storage required for a site to be consider available for production?
I agree, we should discuss this issue with site support: if the site cannot reliably read its files locally it should be in downtime. Downtimes also provide additional motivation for the site to repair because they are tracked and reported to funding agencies.
Replicating comments made in the Q3 document plan, by Hasan. We better put it on hold and have further discussions (also as updated by Nick/Kevin above).
Impact of the new feature Production workflows whose input is staged to a site which has issues with storage/network
Is your feature request related to a problem? Please describe. Yes, recently, we have seen an issue w/ T2_ES_IFCA site which is explained here [1] Although, the input files exist and accessible over xrootd, production workflows cannot read them due this issue. Normally, we should be able to fix these issues in a reasonable amount of time, but this problem with IFCA remains unsolved for almost 2 months, therefore it delays bunch of workflows
[1] https://ggus.eu/?mode=ticket_info&ticket_id=151314
Describe the solution you'd like In case we see a long standing storage/network issue with a site which prevents production jobs to read input, we should be able to stop input staging to this site until this issue is fixed to avoid further failures.
Describe alternatives you've considered None
Additional context It's possible that this feature request is not the best solution to such issues. If you any other suggestions, please feel free to raise them.