dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

Deadlock between FileDeleteTMDB and central agents #940

Open ericvaandering opened 10 years ago

ericvaandering commented 10 years ago

Original Savannah ticket 102420 reported by wildish on Fri Aug 30 05:13:42 2013.

Hi Tony,

recently, the operators have reported ORA deadlock errors when executing global invalidations with FileDeleteTMDB, for example:

+verbatim+ DBD::Oracle::st execute failed: ORA-00060: deadlock detected while waiting for resource (DBD ERROR: OCIStmtExecute) [for Statement "delete from t_dps_block b where b.name = :block and not exists (select 1 from t_dps_file where inblock = b.id)" with ParamValues: :block='/PyquenEvtGen_jpsiMuMu_JPsiPt1215/HiFall11-STARTHI44_V12-v1/GEN-SIM#b8e760ae-ea6d-11e2-b05d-003048f0e38c'] at /uscms/home/cmsdatatransfers/phedex/4_1_2/sw/slc5_amd64_gcc461/cms/PHEDEX-micro/PHEDEX_4_1_2/perl_lib/PHEDEX/Core/DB.pm line 322, <IFILE> line 18. -verbatim-

Correspondingly, the blocks that they were trying to invalidate were left in an inconsistent state in TMDB (the block in t_dps_block has 0 files, but the block replicas in t_dps_block_replica are non-empty), generating alerts in the BlockMonitor and InvariantMonitor central agents. To fix this issue, it is necessary to invalidate the blocks again.

One example is in this Savannah ticket:

https://savannah.cern.ch/support/?func=detailitem&amp;item_id=137053

I think I figured out how this can happen.

1) The operator tries to invalidate a complete block/dataset globally at all nodes, using "FileDeleteTMDB -invalidate" without the "-keepempty" option. 2) The FileDeleteTMDB script deletes from the DB all files in the block, and all corresponding file replicas in t_xfer_replica. 3) Then, FileDeleteTMDB calls deleteEmptyContainers to delete the now empty blocks in t_dps_block, which will also trigger the cascade deletion of the replicas in t_dps_block_replica. However, in the meantime, the BlockMonitor agent can find out that the block replicas in t_dps_block_replica are now empty, and will try to delete them. I think this is the cause of the deadlock... 4) Oracle detects the deadlock and rolls back one of the two statements, leaving the blocks in an inconsistent state.

I'm not sure if it's necessary to fix this issue now, since it's easy to recover from the issue: if the blocks are left in an inconsistent state, since you want to invalidate them you can simply repeat the invalidation command after a while to clean them up.

There are also a couple of workarounds for this issue. For example, the operators can invalidate all files/file replicas without deleting the empty blocks (using the appropriate options in FileDeleteTMDB). Then wait for BlockMonitor to delete the empty block replicas, and finally complete the invalidation deleting the empty blocks with FileDeleteTMDB.

But this problem should be kept in mind when you implement a central Invalidation agent to act on the new "invalidate" requests.

Cheers Nicolo'