Open vlimant opened 7 years ago
I thought Unified checks this. Anyway, we will try to add this feature by next release
yes, unifies check for phedex/dbs inconsistency, and sometimes there is something that needs to be taken care of with transfer team. The point is that if there can systematically be such inconsistency, one has to wait n (= how much?) hours before checking request in "completed" status before checking and acting on the inconsistency. In short, there is no way to know if it's just a delay or just files missing/invalidated in the wild
https://its.cern.ch/jira/browse/CMSCOMPPR-1361 for a use-case where having the synchronisation is mandatory to make sense of the "completed" status
@vlimant - is this still high-priority? It has lingered for an awfully long time.
I'll dump whatever I have to do in October and get this one fixed. Or I close it and we say we can't fix it and we need to live with this forever.
which is it going to be ?
Sorry this was my ticket. I will take care of this.
Linked issue #9543 here, which is reporting the same problem of workflows sitting in completed
status for a long time (hours? days? weeks?) while still waiting data to get injected on DBS (or on DBS and Phedex).
If you do not fix #9543 now, workflows will stay for long time in limbo before getting to completed, and things will look bad all the same overall.
@amaltaro I'm impressed that this issue from 2017 (!) appears in our workplan. Given that this refers to PhEDEx and might include references to other outdated concepts, perhaps you could spend 60 seconds writing a brief updated description at the top (with a new title too, perhaps?) so that a modern audience appreciates the intentions here?
@klannon you are right, apologies for not getting earlier to this. I have just refactored the original issue description.
I wanted to note though that the P&R team does not consider this issue important for Q4, they are actually interested in https://github.com/dmwm/WMCore/issues/11729 (and of course, to no longer have file mismatch in WM, which is a very generic problem). Said that, I am removing it from the 2023/Q4 board.
Fair enough. Thanks!
Impact of the bug WMAgent
Describe the bug Depending on how loaded an agent is, it could be that it takes up to a couple of days to inject data into DBS and/or Rucio. This is specially confusing for workflows that have been recently moved to
completed
status, as data can have been fully injected a few minutes after the transition, or it can take hours to do so, or even days.How to reproduce it Not clear what exactly triggers it.
Expected behavior A good data injection commitment would be to have a dedicated handling of completed workflows (or workflows where all the agent subscriptions have been marked as completed), such that DBS3Uploader and RucioInjector expedite their data injection ahead of anything else already available in the database.
Once those data have been properly injected, the components can proceed with their normal operations.
It is clear that the asynchronization is still there, but provided that everything is stable and functional, it should be a matter of < 2 hours to get all the data available in Rucio and DBS.
Additional context and error message An alternative would be not to mark workflow subscriptions as done unless the relevant output data has been successfully injected into DBS and Rucio. However, this would make it challenging to identify which workflows need to have an expedite data injection, given that there is no communication between the components other than through object state stored in the local relational database.