dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

DBS/Rucio data injection synchronization with workflow completion #8148

Open vlimant opened 7 years ago

vlimant commented 7 years ago

Impact of the bug WMAgent

Describe the bug Depending on how loaded an agent is, it could be that it takes up to a couple of days to inject data into DBS and/or Rucio. This is specially confusing for workflows that have been recently moved to completed status, as data can have been fully injected a few minutes after the transition, or it can take hours to do so, or even days.

How to reproduce it Not clear what exactly triggers it.

Expected behavior A good data injection commitment would be to have a dedicated handling of completed workflows (or workflows where all the agent subscriptions have been marked as completed), such that DBS3Uploader and RucioInjector expedite their data injection ahead of anything else already available in the database.

Once those data have been properly injected, the components can proceed with their normal operations.

It is clear that the asynchronization is still there, but provided that everything is stable and functional, it should be a matter of < 2 hours to get all the data available in Rucio and DBS.

Additional context and error message An alternative would be not to mark workflow subscriptions as done unless the relevant output data has been successfully injected into DBS and Rucio. However, this would make it challenging to identify which workflows need to have an expedite data injection, given that there is no communication between the components other than through object state stored in the local relational database.

ticoann commented 7 years ago

I thought Unified checks this. Anyway, we will try to add this feature by next release

vlimant commented 7 years ago

yes, unifies check for phedex/dbs inconsistency, and sometimes there is something that needs to be taken care of with transfer team. The point is that if there can systematically be such inconsistency, one has to wait n (= how much?) hours before checking request in "completed" status before checking and acting on the inconsistency. In short, there is no way to know if it's just a delay or just files missing/invalidated in the wild

vlimant commented 7 years ago

https://its.cern.ch/jira/browse/CMSCOMPPR-1361 for a use-case where having the synchronisation is mandatory to make sense of the "completed" status

bbockelm commented 6 years ago

@vlimant - is this still high-priority? It has lingered for an awfully long time.

amaltaro commented 6 years ago

I'll dump whatever I have to do in October and get this one fixed. Or I close it and we say we can't fix it and we need to live with this forever.

vlimant commented 5 years ago

which is it going to be ?

ticoann commented 5 years ago

Sorry this was my ticket. I will take care of this.

amaltaro commented 4 years ago

Linked issue #9543 here, which is reporting the same problem of workflows sitting in completed status for a long time (hours? days? weeks?) while still waiting data to get injected on DBS (or on DBS and Phedex).

vlimant commented 4 years ago

9543 is about an additional issue on dbsuploader that has value in being fixed on its own, rather than trying to get a synchronization (longer term) of the status with files injections.

If you do not fix #9543 now, workflows will stay for long time in limbo before getting to completed, and things will look bad all the same overall.

klannon commented 1 year ago

@amaltaro I'm impressed that this issue from 2017 (!) appears in our workplan. Given that this refers to PhEDEx and might include references to other outdated concepts, perhaps you could spend 60 seconds writing a brief updated description at the top (with a new title too, perhaps?) so that a modern audience appreciates the intentions here?

amaltaro commented 1 year ago

@klannon you are right, apologies for not getting earlier to this. I have just refactored the original issue description.

I wanted to note though that the P&R team does not consider this issue important for Q4, they are actually interested in https://github.com/dmwm/WMCore/issues/11729 (and of course, to no longer have file mismatch in WM, which is a very generic problem). Said that, I am removing it from the 2023/Q4 board.

klannon commented 1 year ago

Fair enough. Thanks!