Bug: UPDATED_COL_REP table grows too big

haozturk commented 2 hours ago

Bug Description

Problem described in this ticket by Panos: https://its.cern.ch/jira/browse/CMSDM-210 . I'm creating this issue so that we can include it in our Q4 planning and sort out a solution.

Reproduction Steps

No response

Expected Behavior

No response

Possible Solution

Firstly, do we really need this table and the COLLECTION_REPLICAS table? Is list-dataset-replicas using this table at the moment or the replicas table? If the former, I remember from the rucio workshop ATLAS mentioning that they're running a patch which makes this method use the expensive --deep flag by default and they didn't observe any problem. If that's the case, I think we can consider this option too.

If we eventually decide that we need this table long term, then we need to come up with a way to handle it. I heard Yuyi has done some work to partition it which was not deployed in production [1]

If we'll eventually get rid of it, then we need a procedure to handle this table until we get rid of it. If we make -deep default, I reckon we can create a SQL procedure which will wipe out the UPDATED_COL_REP and COLLECTION_REPLICAS regularly. O/w, we should run another procedure that wipes out the UPDATED_COL_REP table and refills the COLLECTION_REPLICAS using the replicas table.

If eventually rucio decides to drop this table, then we would get rid of this problem completely.

@ericvaandering FYI

[1] https://github.com/yuyiguo/rucio/pull/7/files#diff-6db4929cf5c1d099d8d38edb8fc68e9a4cb70a3fa466b61c238b6f54f6eeefc9

ericvaandering commented 2 hours ago

The solution here is to get the "always deep" patch, apply it, and get rid of the table and the jobs that produce it.

haozturk commented 2 hours ago

Okay, we can try it out next week?

I don't know if we can just delete the tables after this. When rucio tries to update such tables, they'd crush and I don't know how they handle such exceptions. That's why I suggested a cron job which wipes out these tables regularly.

ericvaandering commented 1 hour ago

I did it now.

The job COLL_REPL_UPDATED_JOB_CMS runs COLL_REPLICAS_UPDATE_ALL

That job was stopped and disabled. Rucio was patched to always use --deep (very simple patch).

I did not delete the table. Rucio itself only tries to read this table from what I know, not update them.

dmwm / CMSRucio