dmwm / CMSRucio

7 stars 31 forks source link

Audit and introduce the rucio decommisioner daemon #585

Open dynamic-entropy opened 1 year ago

dynamic-entropy commented 1 year ago

https://github.com/rucio/rucio/issues/6321 aims to introduce a rucio daemon that can be used for RSE migrations and decommissioning. It is being worked on in PR https://github.com/rucio/rucio/pull/6066. We need to assess how to use this dameon for operations. A couple of broad question we may ask are

amanrique1 commented 11 months ago

I like this idea, it can be useful. I see some modifications that we could do. Let's start with our current process:

  1. Protect files without protected replicas outside or move existing rules
  2. Loadtest to false
  3. Empty the site
    1. Update rule: locked false and cancel requests if state not OK
    2. Delete rule: purge replicas true
  4. Delete protocols
  5. Add mock protocol
  6. Set rse_attribute - update_from_json: False
  7. Set rse_attribute - reaper: True
  8. Update use - {'availability_delete':True}
  9. Set rse_attribute - greedyDeletion: True
  10. When the site is empty client.delete_rse(‘site’)

This is how it is handled on the generic daemon:

  1. 2 attr options: delete or move.
  2. No load test handling
  3. Lifetime 0 to rules with RSE_EXPRESSION = RSE
    1. What about locked?????? This is a global rucio error because lifetime could be changed to 0 even if locked=True
    2. What to do with the rules with generic RSE_EXPRESSIONS (not the site name)
    3. What about split_container = true??????
  4. Not done
  5. Not done
  6. Not done
  7. Not done
  8. Done
  9. Done
  10. Not done, an important part when reaching this point is to have the site with 0 rucio available space.
    1. Note: I remember a weird error when decommissioning a site and collection_replicas table didn't show the correct information

Some things can be easily added into the code for the missing params setup (protocol, reaper, loadtest, and update_from_json)

On the other hand, we have some decisions to me related to what I said about step 3.

Finally, we should add an extra step to do the RSE deletion and handle the possible exceptions getting in that step

dynamic-entropy commented 11 months ago

Hello @amanrique1

These are quite interesting observations. Given your summary of the current implementation, I think we can go ahead and talk to the person working on this, asking if their instance does not need these, and where will experiment-specific settings sit in the daemon workflow. Do you think we should arrange a meeting? Explaining our intent to help with the development and testing of the daemon?

Cheers

amanrique1 commented 11 months ago

Hi @dynamic-entropy

There is a folder named Profiles, where I guess each experiment should add its specific settings and implementation (I did my analysis based on the generic profile).

I like that meeting idea because I do not have clear what the CMS custom features added to rucio are.

dynamic-entropy commented 11 months ago

Hello Andres I had a chat with Dimitrios, and he suggested we hold off till the PR is ready to merge. We can already start working on the profile if you want.

For the third point,

  1. I am told that it takes care of rules with distributed (multi rse expression) as well, why do you claim if that is the case?
  2. Locked rules are designed to be handled manually, that is why they are locked.
  3. "What about split_container = true??????" What about them?
amanrique1 commented 11 months ago

Hi Rahul My bad. I checked again and there are no problems with the generics RSE_EXPRESSIONS, it gets the rules based on the locks table and not based on the rules table.

When split_container=true there would be problems because it's going to delete the rule, and the rule is not only protecting replicas in site to decommission. In the move rule scenario, the rules are going to move to a specific RSE all the containers and not only the datasets that are currently there.

Additionally to that, in the move scenario, they won't be able to be cleaned. The loadtest won’t be moved due to an ALREADY existing exception.

I think that these are not really a big deal but we should take into account when using the daemon

dynamic-entropy commented 11 months ago

In the move rule scenario, the rules are going to move to a specific RSE all the containers and not only the datasets that are currently there.

But if the destination is also generic rse expression that will not cause any extra transfers.

Additionally to that, in the move scenario, they won't be able to be cleaned. The loadtest won’t be moved due to an ALREADY existing exception. Loadtests are CMS specific and we should deal with them in the profile part of the procedure.

So, if you want, you can already start looking into what all is possible in the profile and already start drafting one for us.

dynamic-entropy commented 7 months ago

Hello @amanrique1 The RSE Decommisionner daemon is not finalised and will be available to us when we upgrade to the next release. You can start looking at coding the CMS profile I guess, based on the priorities of other issues in your bucket. Cheers