dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

BlockAllocator failed to create block-subscriptions for new blocks in subscribed dataset #877

Open ericvaandering opened 10 years ago

ericvaandering commented 10 years ago

Original Savannah ticket 93051 reported by None on Tue Mar 27 12:38:32 2012.

Hi,

https://savannah.cern.ch/support/?127451

in the ticket above, the BlockAllocator agent never created a block-level subscription for a newly injected block for which a dataset-level subscription existed.

The block is this, to destination T1_IT_CNAF_MSS:

+verbatim+ https://cmsweb.cern.ch/phedex/datasvc/xml/prod/blockreplicas?block=/QCD_Pt-120to170_MuEnrichedPt5_TuneZ2star_8TeV_pythia6/Summer12-START50_V13-v2/GEN-SIM%238301d524-724a-11e1-9bbe-003048f0e7dc -verbatim-

The subscription:

+verbatim+ https://cmsweb.cern.ch/phedex/datasvc/perl/prod/subscriptions?collapse=n&block=/QCD_Pt-120to170_MuEnrichedPt5_TuneZ2star_8TeV_pythia6/Summer12-START50_V13-v2/GEN-SIM%23* -verbatim-

The transfer request is this. It is not time-based, so all blocks should have been added to the subscription:

https://cmsweb.cern.ch/phedex/prod/Request::View?request=377736

According to the BlockAllocator agent, the block was never considered for subscription.

Possibly result of a race condition between two simultaneous injections of different blocks into the same dataset, which caused the BlockAllocator to skip one of the blocks when moving forward t_dps_subs_dataset.time_fill_after?

Need to investigate the reason, and identify all other dataset-level subscriptions with missing blocks (especially custodial!)

Cheers Nicolo'

ericvaandering commented 10 years ago

Comment by wildish on Wed Mar 28 03:19:42 2012

Hi Nicolo',

I'm raising the priority of this ticket, given the possible impact on custodial data, though of course you know how important it is.

Maybe it would be good to get Rapolas to investigate blocks that may have fallen through this hole while you are debugging it, so he can repair any existing damage immediately?

ericvaandering commented 10 years ago

Comment by magini on Wed Mar 28 05:23:38 2012

Hi Tony,

from a preliminary investigation last night, I think that only two custodial blocks were skipped since the deployment of PHEDEX_4_0_0 one year ago, so it is a very rare condition. The recovery procedure is also simple: just re-subscribe the blocks before someone sends a manual deletion request to the source node (which should never happen for data produced at T0, and is unlikely to happen for MC data because everyone expects automated deletion).

I will look into fixing it, but for now I think that we can survive by regularly checking that all produced blocks have a custodial subscription (Operations already perform these checks). The chance for data loss due to this bug, in my opinion, is much lower than the chance for data loss due to external factors (simply forgetting to request the subscription, for example).

Cheers Nicolo'

ericvaandering commented 10 years ago

Comment by wildish on Wed Mar 28 05:29:58 2012

fair enough, in that case I put the priority back down :-)

ericvaandering commented 10 years ago

Comment by magini on Wed Mar 28 10:42:11 2012

Hi,

I double-checked all dataset-level subscriptions in the Prod instance which have "missing" block-level subscriptions. In nearly all cases, the "missing" blocks actually have a valid reason to be unsubscribed. They were either: a) deleted later with a block-level deletion or b) correctly skipped because the subscription had a time-start.

The only blocks which were skipped due to a bug are the following two:

+verbatim+ /QCD_Pt-150_bEnriched_TuneZ2star_8TeV-pythia6-evtgen/Summer12-START50_V13-v2/GEN-SIM#b037fac0-7225-11e1-9bbe-003048f0e7dc /QCD_Pt-120to170_MuEnrichedPt5_TuneZ2star_8TeV_pythia6/Summer12-START50_V13-v2/GEN-SIM#8301d524-724a-11e1-9bbe-003048f0e7dc -verbatim-

Both were injected at T2_FR_CCIN2P3 at the same time (time_create=1332236567.80183)

+verbatim+ https://cmsweb.cern.ch/phedex/datasvc/xml/prod/data?block=/QCD_Pt-120to170_MuEnrichedPt5_TuneZ2star_8TeV_pythia6/Summer12-START50_V13-v2/GEN-SIM%238301d524-724a-11e1-9bbe-003048f0e7dc&block=/QCD_Pt-150_bEnriched_TuneZ2star_8TeV-pythia6-evtgen/Summer12-START50_V13-v2/GEN-SIM%23b037fac0-7225-11e1-9bbe-003048f0e7dc -verbatim-

And should have been subscribed to T1_IT_CNAF_MSS because there was already a dataset-level subscription in place without a time-start.

I'll continue to investigate the reason for the bug.

Cheers Nicolo'