Implement the T0 block deletion logic in RucioInjector

amaltaro commented 4 years ago

Impact of the new feature WMAgent

Is your feature request related to a problem? Please describe. As we start moving to Rucio, we need to re-evaluate how the usual PhEDEx block deletion was performed by the T0 agent (even though any other agent could use that code), and implement the proper logic with Rucio rules.

Describe the solution you'd like In general, we need to implement the block deletion logic in the new RucioInjector component, from this PR: https://github.com/dmwm/WMCore/pull/9617

That block deletion is currently used only by the T0-WMAgent, so we need to make sure to support their use case before T0 can switch to Rucio. A simple and possible solution for this use case would be:

create a block level rule while the block is being produced (to T2_CH_CERN ?)
something (or the agent) creates a container level rule (I guess to a different RSE)
once we are done processing that block (via a PromptReco workflow?), we check what are the rules available for that container. If there is one, using the same (rucio acct, rse expression, did) as we expect, then we can delete the block rule.

Describe alternatives you've considered I think it would still be useful to write in plain english what's the current logic in place for PhEDEx, such that we don't miss any important details here.

Additional context Note it's very similar to this issue: https://github.com/dmwm/WMCore/issues/8134 But here in this issue I'd like to focus on the WMAgent side only, while the issue #8134 is meant to address how to wisely use the Rucio rules/deletion features in the WM system.

hufnagel commented 4 years ago

Since I wrote that code, here is the current logic.

First, this is a per subscription settings, you configure whether you want it when you setup the workflows and it's subscriptions. By default it's off, the only place we enable it is in the Tier0

https://github.com/dmwm/T0/blob/master/src/python/T0/RunConfig/RunConfigAPI.py#L356

Second, if this is enabled for the subscription, it works at the block level. It finds all blocks that

A1) have cleanup enabled for the subscription A2) the PhEDEx subscription is made A3) are closed A4) have not been cleaned up yet A5) and all workflows for this block have completed

For the actual rules see

https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/RucioInjector/Database/MySQL/GetDeletableBlocks.py#L27

Once we have that list of blocks, we get every active subscription for these blocks (not within the agent, but from PhEDEx). We need this block/subscription mapping later.

Then it loops over the blocks individually and checks for each block

B1) if the block is on tape, skip deletion and mark block deleted (should never happen) B2) if the block has an active PhEDEx subscription to it's location, skip deletion and mark block deleted B3) if neither 1 nor 2, get all current replicas for the block from PhEDEx B3.1) if the block is not present at the location we want to clean it up from, just mark it as deleted and move one B3.2) if the blocks is at all the locations it is subscribed to in the agent, trigger deletion in PhEDEx and mark the block deleted if that delete call succeeds

With Rucio and us doing that for all data the agent produced not all of these are still relevant anymore.

First order I would say you still need A2, A3, A4, A5 and all the B stuff collapses into just removing the single rule that kept a block on it's location where you produced it.

nsmith- commented 4 years ago

Few questions, not T0 specific:

What is the venn diagram of workflow vs. agent vs. block?
How long can a block stay open?

The non-T0 way of handling this is to have dynamo greedily delete any block injected by WMAgent that unified or WMCore made the "DataOps" subscription for, once a final subscription has been generated. I think going forward we should really make sure the same logic is used, since as far as I know the goals are the same.

amaltaro commented 4 years ago

Nick, non-T0 way to delete data relies on dynamo, basically anything that is not locked by central production (or any other group) can go away.

About the diagram, in short (for central production - non-T0 agent):

1 workflow can run in [1..n] agents
1 block belongs to 1 agent only, and 1 workflow only

default configuration set blocks to remain in the open status for ~18h, but this is configurable in a workflow basis.

nsmith- commented 4 years ago

Ok, follow-up:

Would it be reasonable to make the block-level rule only once the block is closed? (the replicas would continue to be injected throughout)

I am leaning toward always making a fixed-lifetime block-level rule for each block. If it is truly open for a short time (18h +-) then this is fine, and if not then we could make the rule at the end (knowing that no harm will come to injected files until a rule is created and subsequently removed) The expectation is that some other component, somewhere, has already created the destination ("NonCustodialSite") rule and that will protect files that are there or destined to go there (any replica that is the last remaining source of a transfer request in rucio is protected by default even if the block-level rule expires)

nsmith- commented 4 years ago

To be clear, the non-T0 way must change with the rucio transition, and as far as I can tell we can do the same in both cases

amaltaro commented 4 years ago

Nick, can you please enlighten me on the difference of making a block rule for a block recently created (thus, opened and still getting files attached to it) versus a rule for a block that has been already closed (thus, no new files getting attached to it).

About the lifetime of an open block, we might have some special cases for central production, e.g.:

if workflow is done processing, then we want to close the blocks, inject and announce them as soon as possible (yet to be implemented)
if the DBS3Upload component goes down, it could be that blocks remain open for many more hours

In addition to that, it looks like T0 has different configurations. Like an hour for Express workflows and 24h for all the rest (if I spotted all the places): https://github.com/dmwm/T0/blob/master/src/python/T0/RunConfig/Tier0Config.py#L844

hufnagel commented 4 years ago

I think you need to make Rucio rules (both final and local keep in place safety rule) for a block once a block is opened and the first file is in the block. Reason is that you want transfers to start with the first file in the block, not when the block is closed.

We have this with PhEDEx right now and for a reason, 18h (or 24h for T0) is just too much time to wait to start transferring. Both for data safety reasons and for disk space reasons.

hufnagel commented 4 years ago

@amaltaro , yes, T0 customizes this due to latency requirements for Express and some otherwise very small block sizes for bulk (some streams/datasets are really low rate).

nsmith- commented 4 years ago

Alan, for the block-level rule I think its best to create it at the end because the intention of the rule is essentially only to ensure eventual cleanup. If we create the rule at the beginning, and it expires before all replicas are injected and attached to the block, then those last few file replicas will not have ever had a rule on them and will live at their source site forever.

Dirk, a transfer can start the moment the DID replica is injected and a rule matches it. There is no need to wait to close a block before rucio can start transferring files to satisfy the rule. In particular, if the final output destination container rule is made at the beginning of the workflow, transfers will happen to satisfy it as each file is injected. To note, of course this is not synchronous with the WMCore call to Rucio APIs. We might want to consider labeling rules from T0 with special "activity" field so that we can run separate transfer supervision daemons (rucio-conveyor-*) dedicated to this activity to ensure low latency. @ericvaandering FYI

I put together this example code that I've been using to test and verify my statements: https://gist.github.com/nsmith-/3074cadda71a91ae60ebe16039cfa1dc (Alan you've already seen it but we might consider continuing discussion here)

amaltaro commented 4 years ago

@hufnagel Dirk, on the T0 land. Don't we always have a "long term" dataset (rucio container) subscription ensuring that the dataset is replicated (and at full) somewhere? If so, then I believe T0 could have use of block level rules with expiration time.

@nsmith- Nick, as we have been discussing in that private thread - for central production land - the block level rule cannot have an expiration time because workflows can remain in the system for many months. TaskChain workflows require the intermediate blocks to be available at their origin location, so we need to ensure that those blocks remain there until the workflow is completed within the agent working on them (regardless whether the whole container is also available at full in a different location).

hufnagel commented 4 years ago

If you make or have the final destination rule in place when the blocks is opened with it's first file and you only make the rule to keep the block on the local disk (much) later, how do you guarantee that the block or files in the block aren't cleaned up assuming the transfers happen fast?

nsmith- commented 4 years ago

Any file replica that has not had a rule removed will not be cleaned up. In your case, the file replica has no rule keeping it at the local disk (only a rule to trigger a transfer to some remote destination) so after the transfer completes the source file will remain indefinitely until one or more rules are created that match that replica, and then are all subsequently removed.

hufnagel commented 4 years ago

@amaltaro , to answer your question, yes and no.

The PhEDEx subscriptions are setup per workflow (so run/stream or run/dataset) and they are setup for datasets, not blocks. Tier0 configs usually stay the same for many runs, so yes, usually you have dataset subscriptions in place that already cover the data you produce.

But no, in general (and for every new run after a Tier0 config change that creates new dataset names), you only make the PhEDEx dataset subscriptions when the first file for a new block in a new dataset is produced and processed by the agent.

In principle I guess it's the same for normal WMAgent workflows, with the additional complications that you have multiple agents producing data for the same dataset (so they all make identicl subscriptions for the same dataset). Tier0 only has one agent, so that part is simpler there.

This is the current behavior with PhEDEx though, it doesn't have to be that way for Rucio. Depends on whether we make individual Rucio rules for every pieces of output or we rely on Rucio subscriptions with general policies that make the rules behind the scenes once new data is made known to Rucio. How are we planning to do this with the NanoAOD? And is this specific for NanoAOD or do we want to follow the same model for all data we produce?

hufnagel commented 4 years ago

@nsmith- , relying on the default cleanup rules could still be dangerous in edge cases, couldn't it?

I mean, the cleanup can happen if any rule has been removed, not just the rule keeping the data at the location you produced it at, right? I have a hard time seeing, timing-wise, how you could make a rule transferring it somewhere else and remove that rule before we close the block, but I would feel safer if we make the pin rule asap (so when we have the first piece of data).

nsmith- commented 4 years ago

the cleanup can happen if any rule has been removed

The cleanup can happen after the last rule has been removed. One could do something like this to prematurely delete the file:

Container rule created for (C, T1_XX_Output)
Replica (/store/blah.root, T2_XX_Input) injected, lock count: 0
DID /store/blah.root attached to container C, rule detects match, (/store/blah.root, T1_XX_Output) transfer initiated, (/store/blah.root, T2_XX_Input) has transfer last-copy protection
Rule created matching replica (/store/blah.root, T2_XX_Input), lock count: 1
Rule expires matching replica (/store/blah.root, T2_XX_Input), lock count: 0, tombstone set
(/store/blah.root, T1_XX_Output) transfer completes, (/store/blah.root, T2_XX_Input) loses transfer last-copy protection
Reaper deletes replica (/store/blah.root, T2_XX_Input)
Rule created matching block that replica (/store/blah.root, T2_XX_Input) was attached to, but replica is gone (this would actually trigger a transfer from T1_XX_Output back)

The question is what kind of rule would be created as in 4? It would be extraneous to the normal operation, and have to be timed just right. I guess if steps 4-5 were inserted instead between 2 and 3 (which should happen very close in time) then the original copy would be in danger. So yes, to cover all our bases a block rule should come when the block DID is constructed.

hufnagel commented 4 years ago

Yeah, I was thinking about something like that. And yes, I realize the chances of this happening are very very small (the timing would have to be exactly right). But we'll be going trough this loop a gazillion times and our users do things we don't expect sometimes, so...

nsmith- commented 4 years ago

you have multiple agents producing data for the same dataset (so they all make identicl subscriptions for the same dataset)

This was my point in #9621 and I think, barring performance issues, we can just let each agent to attempt to make the dataset rule, and if it fails because the rule exists then just ignore the exception. This is assuming that all agents agree on the output site, which I think is always the case.

Depends on whether we make individual Rucio rules for every pieces of output or we rely on Rucio subscriptions with general policies that make the rules behind the scenes once new data is made known to Rucio. How are we planning to do this with the NanoAOD? And is this specific for NanoAOD or do we want to follow the same model for all data we produce?

This is an open topic how we want to do this in general. For the end-of-the-month NanoAOD plan, @ericvaandering and I think its best to stick with rucio subscriptions. It will be a good test to see if these work well for us, and if so we might consider using them in general. But I think it would still be wise to support specifying the output rules in the workflow in case subscriptions are not sufficiently fine-tunable (which I am suspecting will be the case.) I mentioned this in https://github.com/dmwm/WMCore/issues/9638#issuecomment-614303404 and Alan spun it into #9639

germanfgv commented 4 years ago

After discussion with Dima and @amaltaro, we decided to start implementation of block deletion logic in RucioInjector. The following is the pseudo-code that I will implement:

deleteBlocks()
    get block dictionary using GetDeletableBlocks()
    Get dataset name from block

    Check which rules are blocking the dataset
    if single rule then
        Make sure we have multiple RSE

    now check if block is available at those RSE
    if block is available in at least 2 of the desired RSE (non parking data) then
         Find ID ID of rule created by WMAgent
         Delete rule
    else
         Wait for next cycle
    return

Please let me know if you have any comments

drkovalskyi commented 4 years ago

We have outputs that should not require two copies. How the agent knows which block needs to have two copies and which doesn't? Where do we store this information?

hufnagel commented 4 years ago

The current logic looks at all subscriptions made by the Tier0 agent for a block(really dataset) and checks that there is a replica at each of these sites. There is no explicit mapping of data type to number of copies, the system detects that automatically depending on how the transfers were configured. With Rucio you simply have additional rules for a piece of data then, no different.

Btw, I thought this "deletion" would be a generic piece of code in the agent? We simply remove the rule the agent set to keep a piece of data it produced on disk at the production site. Isn't that how all agents will work?

amaltaro commented 4 years ago

For central production, I think we could rely on finished subscriptions to trigger a block rule deletion. However, this won't work for T0 because in T0 it needs to be aware of the other workflows that might need to read that data, right?

From our discussion this morning, Dima said that having 2 rules (read 2 copies) of the data isn't enough to let a block-level rule to get deleted. The system needs to also ensure that those 2 rules have been completely satisfied (meaning, blocks are locked and available at the expected destination).

klannon commented 4 years ago

I'm a little confused by the terminology being used here. I thought our agreement in previous meetings was that we trust Rucio. This means that the WM system doesn't have to verify the status of any rules. So, I would have argued that things should be phrased as follows:

From our discussion this morning, Dima said that having 2 rules (read 2 copies) of the data isn't enough to let a block-level rule to get deleted. The system needs to also ensure that those 2 rules have been completely satisfied (meaning, blocks are locked and available at the expected destination).

becomes

"The system needs to ensure that there is a rule requesting two replicas elsewhere. Then it can remove the rule protecting the original block."

WM code should never check whether Rucio has acted. It should just express its desires via rules and assume Rucio will act. Rucio is responsible for not deleting the original block until the rule requesting two replicas elsewhere has been satisfied. Am I misunderstanding? I'll mention @nsmith- here since he was the one advocating for this frame of mind.

nsmith- commented 4 years ago

Indeed, Rucio should in principle never delete the last replica of data if it is needed to complete a transfer, even if the rule protecting it is removed. I think the concern was that for T0, where it really is paramount that the RAW gets safely to two tape sites, that we cannot afford to trust Rucio on its own and write some "double-check" logic. If this double-check code is not too complicated then maybe we can implement it and remove it sometime later if we gain confidence. I'm mostly neutral on this, but specifically for RAW. Any other case we should just delete the rule.

amaltaro commented 4 years ago

If the RucioInjector just has to check whether there are 2 container-level rules locking that data elsewhere (or 1 rule with copies=2), this would certainly be the easiest path and more performant. If we have to implement the extra location/availability logic for RAW, then it of course doesn't make much difference (in terms of development effort) if only used for one datatier or for all of them.

This decision is not up to me though, I'm still trying to understand the exact requirements and the best way to implement it.

drkovalskyi commented 4 years ago

I agree with Nick. Given that at the moment Rucio cannot protect two copies let's proceed with the plan that is outlined by German and clarified by Dirk, i.e. the block deletion code will check all subscriptions/rules made by tier0 and see if blocks have reached the destination. @germanfgv could you start working on it and we can follow on details when you have a pull request ready to be reviewed.

hufnagel commented 4 years ago

For central production, I think we could rely on finished subscriptions to trigger a block rule deletion. However, this won't work for T0 because in T0 it needs to be aware of the other workflows that might need to read that data, right?

Tier0 workflows are self-contained, there are no "other" workflows that use the same data.

That's even true for Repacking and PromptReco, which for monitoring purposes and by workflow name are separate, but WMBS internal are really just one unit. If you trace through WMBS by input fileset, workflow/subscription and output fileset, Repack and PromptReco form one large tree diagram.

One caveat, the PromptReco parts of the tree are only attached to the Repack parts when PromptReco is released. To protect the Repack workflow and any of it's output data from being cleaned up before PromptReco release, the Repack workflows are not set as "injected" until after PromptReco release.

For a generic agent deletion logic, Tier0 Repack/PromptReco shouldn't look any different to a standard TaskChain. The one complication is the caveat I mentioned, protecting Repack output before PromptReco is released.

dmwm / WMCore

Implement the T0 block deletion logic in RucioInjector #9623