dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

Some BlockLatency records never marked as completed #899

Closed ericvaandering closed 10 years ago

ericvaandering commented 10 years ago

Original Savannah ticket 95854 reported by magini on Tue Jul 3 11:55:01 2012.

Hi,

currently in Prod there are two entries in the t_dps_block_latency table which are not yet marked as complete, despite the fact that the corresponding block destinations are complete (state=3):

+verbatim+ select * from t_dps_block_latency where last_replica is null and (block,destination) in (select block,destination from t_dps_block_dest where state=3)

TIME_UPDATE DESTINATION BLOCK FILES BYTES PRIORITY I TIME_SUBSCRIPTION BLOCK_CREATE BLOCK_CLOSE LATEST_REPLICA LAST_REPLICA LAST_SUSPEND PARTIAL_SUSPEND_TIME TOTAL_SUSPEND_TIME LATENCY


1341333515 8 3599181 18 6.7643E+10 1 y 1340963471 1340884580 1340944615 0 1341333515 8 3599349 10 3.8407E+10 1 y 1340963471 1340898322 1340971787 0 -verbatim-

Probable reason: 1) BlockAllocator waits until BlockMonitor has created a replica in t_dps_block_replica before creating a latency entry in t_dps_block_latency. In the meantime, transfers can start. 2) If a transfer completes before the block is recorded in t_dps_block_latency, FilePump will not create the corresponding file-level entry in t_xfer_file_latency. If this happens for ALL files in the block, there will be no file-level entries for this block, so the block-level entry will never be marked completed in t_dps_block_latency by BlockAllocator.

N.

ericvaandering commented 10 years ago

Comment by magini on Mon Dec 17 08:48:34 2012

Hi,

fix released in PHEDEX_4_1_1 and deployed to Production. Now the existing entries in t_dps_block_latency with this issue need to be fixed.

N.

ericvaandering commented 10 years ago

Closed by magini on Mon Dec 17 12:08:44 2012

ericvaandering commented 10 years ago

Comment by magini on Mon Dec 17 12:08:44 2012

Hi,

manually fixed the 36 blocks which had this issue in Production, setting t_dps_block_latency.last_replica=t_dps_block_dest.time_complete

+verbatim+ SQL> select bl.block,bl.destination,bl.last_replica,bd.time_complete from t_dps_block_dest bd join t_dps_block_latency bl on bl.block=bd.block and bl.destination=bd.destination and bl.last_replica is null and bd.state=3 2 ;

 BLOCK DESTINATION LAST_REPLICA TIME_COMPLETE

3794168 17 1347237212 3794609 17 1347241938 3793827 17 1347239960 3793655 17 1347223219 3599181 8 1340964715 3599349 8 1340972118 3936808 1262 1352005627 3941498 1262 1352181832 3893199 19 1350642452 3893200 19 1350642452 3657670 24 1342971300 3657691 24 1342971300 3657672 24 1342971300 3657686 24 1342971300 3657692 24 1342971300 3657674 24 1342971300 3657675 24 1342971300 3657677 24 1342971300 3657680 24 1342971300 3657671 24 1342971300 3657543 24 1342971300 3657545 24 1342971300 3657540 24 1342971300 3657541 24 1342971300 3686839 8 1343753570 3686880 8 1343753570 3628709 10 1342006259 3628711 10 1342006259 3628712 10 1342006259 3628744 10 1342006259 3627189 10 1342006259 3743493 2 1345517376 3849320 4 1349354553 3770380 17 1346449499 3764365 2 1346260700 3874498 4 1350085977

36 rows selected.

SQL> update (select bl.block,bl.destination,bl.last_replica,bd.time_complete from t_dps_block_dest bd join t_dps_block_latency bl on bl.block=bd.block and bl.destination=bd.destination and bl.last_replica is null and bd.state=3) v set v.last_replica=v.time_complete 2 ;

36 rows updated.

SQL> select bl.block,bl.destination,bl.last_replica,bd.time_complete from t_dps_block_dest bd join t_dps_block_latency bl on bl.block=bd.block and bl.destination=bd.destination and bl.last_replica is null and bd.state=3 2 ;

no rows selected

SQL> commit;

Commit complete.

-verbatim-

Closing N.

ericvaandering commented 10 years ago

Comment by magini on Mon Dec 17 12:27:15 2012

Hi,

InfoStatesClean started to give alerts, because I forgot to set the LATENCY for these blocks. Now fixed:

+verbatim+ 2012-12-17 18:09:49: InfoStatesClean[15207]: alert: database error: DBD::Oracle::st execute failed: ORA-01400: cannot insert NULL into ("CMS_TRANSFERMGMT"."T_LOG_BLOCK_LATENCY"."LATENCY") -verbatim0-

+verbatim+ SQL> select * from t_dps_block_latency where latency is null and last_replica is not null;

TIME_UPDATE DESTINATION BLOCK FILES BYTES PRIORITY I TIME_SUBSCRIPTION BLOCK_CREATE BLOCK_CLOSE LATEST_REPLICA LAST_REPLICA LAST_SUSPEND PARTIAL_SUSPEND_TIME TOTAL_SUSPEND_TIME LATENCY


1355767102 8 3599181 18 6.7643E+10 1 y 1340963471 1340884580 1340944615 1340964715 0 1355767102 8 3599349 10 3.8407E+10 1 y 1340963471 1340898322 1340971787 1340972118 0 1355767102 10 3627189 3 1791627829 1 y 1342005137 1341837798 1341866667 1342006259 0 1355767102 10 3628711 1 1526870 1 y 1342005137 1341881199 1341909963 1342006259 0 1355767102 10 3628712 8 3519079730 1 y 1342005137 1341881199 1341909963 1342006259 0 1355767102 10 3628744 1 1524509 1 y 1342005137 1341884347 1341913183 1342006259 0 1355767102 10 3628709 1 1535985 1 y 1342005137 1341881052 1341909861 1342006259 0 1355767102 8 3686880 61 1.2202E+11 1 y 1343722719 1343697692 1343753507 1343753570 0 1355767102 24 3657670 1 474331613 2 n 1342969424 1342805267 1342805267 1342971300 0 1355767102 24 3657540 1 43855894 2 n 1342969409 1342802932 1342802932 1342971300 0 1355767102 24 3657541 1 43847702 2 n 1342969409 1342802933 1342802933 1342971300 0 1355767102 24 3657545 1 21099509 2 n 1342969409 1342802939 1342802939 1342971300 0 1355767102 4 3849320 1 2682135492 1 y 1349294350 1349294350 1349354430 1349354553 0 1355767102 24 3657671 1 31616027 2 n 1342969424 1342805269 1342805269 1342971300 0 1355767102 24 3657686 1 54106344 2 n 1342969424 1342805289 1342805289 1342971300 0 1355767102 24 3657691 1 46016001 2 n 1342969424 1342805296 1342805296 1342971300 0 1355767102 8 3686839 11 1.9917E+10 1 y 1343722719 1343696227 1343753402 1343753570 0 1355767102 24 3657692 1 60873724 2 n 1342969424 1342805297 1342805297 1342971300 0 1355767102 24 3657672 1 31435623 2 n 1342969424 1342805270 1342805270 1342971300 0 1355767102 24 3657543 1 43806230 2 n 1342969409 1342802936 1342802936 1342971300 0 1355767102 24 3657674 1 31433867 2 n 1342969424 1342805273 1342805273 1342971300 0 1355767102 24 3657675 1 31361729 2 n 1342969424 1342805274 1342805274 1342971300 0 1355767102 24 3657677 1 46829543 2 n 1342969424 1342805277 1342805277 1342971300 0 1355767102 24 3657680 1 46569738 2 n 1342969424 1342805281 1342805281 1342971300 0 1355767102 4 3874498 1 174217294 1 y 1350019896 1350019896 1350086042 1350085977 0 1355767102 2 3743493 1 4950350 0 y 1345515624 1345515624 1345516989 1345517376 0 1355767102 1262 3936808 1 556395659 1 y 1351919168 1351919168 1352005667 1352005627 0 1355767102 2 3764365 1 7836549 0 y 1346258794 1346258794 1346260597 1346260700 0 1355767102 17 3770380 1 2287328206 1 y 1346399083 1346399083 1346449258 1346449499 0 1355767102 17 3793655 2 4665897656 1 y 1347201264 1347173410 1347223116 1347223219 0 1355767102 17 3794168 10 3.3390E+10 1 y 1347201258 1347186312 1347236834 1347237212 0 1355767102 17 3793827 5 1.5897E+10 1 y 1347201278 1347179758 1347239756 1347239960 0 1355767102 17 3794609 1 599354169 1 y 1347201272 1347191207 1347241496 1347241938 0 1355767102 1262 3941498 2 1269575284 1 y 1352095276 1352095276 1352181592 1352181832 0 1355767102 19 3893199 238 8.2214E+11 1 y 1350641252 1350563777 1350630286 1350642452 0 1355767102 19 3893200 40 230753022 1 y 1350641252 1350563777 1350630286 1350642452 0

36 rows selected.

SQL> ed Wrote file afiedt.buf

1 update t_dps_block_latency set 2 PARTIAL_SUSPEND_TIME=NULL, 3 TOTAL_SUSPEND_TIME=nvl(partial_suspend_time,0) + nvl(total_suspend_time,0), 4 latest_replica = NULL, 5 last_suspend = NULL, 6 latency = last_replica - greatest(block_create,time_subscription) - (nvl(total_suspend_time,0)+nvl(partial_suspend_time,0)) 7* where latency is null and last_replica is not null SQL> r 1 update t_dps_block_latency set 2 PARTIAL_SUSPEND_TIME=NULL, 3 TOTAL_SUSPEND_TIME=nvl(partial_suspend_time,0) + nvl(total_suspend_time,0), 4 latest_replica = NULL, 5 last_suspend = NULL, 6 latency = last_replica - greatest(block_create,time_subscription) - (nvl(total_suspend_time,0)+nvl(partial_suspend_time,0)) 7* where latency is null and last_replica is not null

36 rows updated.

SQL> select * from t_dps_block_latency where latency is null and last_replica is not null;

no rows selected

SQL> commit; -verbatim-

ericvaandering commented 10 years ago

Comment by magini on Mon Dec 17 12:28:20 2012

+verbatim+ 2012-12-17 18:09:49: InfoStatesClean[15207]: alert: database error: DBD::Oracle::st execute failed: ORA-01400: cannot insert NULL into ("CMS_TRANSFERMGMT"."T_LOG_BLOCK_LATENCY"."LATENCY") ...

SQL> select * from t_dps_block_latency where latency is null and last_replica is not null;

TIME_UPDATE DESTINATION BLOCK FILES BYTES PRIORITY I TIME_SUBSCRIPTION BLOCK_CREATE BLOCK_CLOSE LATEST_REPLICA LAST_REPLICA LAST_SUSPEND PARTIAL_SUSPEND_TIME TOTAL_SUSPEND_TIME LATENCY




1355767102 8 3599181 18 6.7643E+10 1 y
1340963471 1340884580 1340944615 1340964715
0 1355767102 8 3599349 10 3.8407E+10 1 y
1340963471 1340898322 1340971787 1340972118
0 1355767102 10 3627189 3 1791627829 1 y
1342005137 1341837798 1341866667 1342006259
0 1355767102 10 3628711 1 1526870 1 y
1342005137 1341881199 1341909963 1342006259
0 1355767102 10 3628712 8 3519079730 1 y
1342005137 1341881199 1341909963 1342006259
0 1355767102 10 3628744 1 1524509 1 y
1342005137 1341884347 1341913183 1342006259
0 1355767102 10 3628709 1 1535985 1 y
1342005137 1341881052 1341909861 1342006259
0 1355767102 8 3686880 61 1.2202E+11 1 y
1343722719 1343697692 1343753507 1343753570
0 1355767102 24 3657670 1 474331613 2 n
1342969424 1342805267 1342805267 1342971300
0 1355767102 24 3657540 1 43855894 2 n
1342969409 1342802932 1342802932 1342971300
0 1355767102 24 3657541 1 43847702 2 n
1342969409 1342802933 1342802933 1342971300
0 1355767102 24 3657545 1 21099509 2 n
1342969409 1342802939 1342802939 1342971300
0 1355767102 4 3849320 1 2682135492 1 y
1349294350 1349294350 1349354430 1349354553
0 1355767102 24 3657671 1 31616027 2 n
1342969424 1342805269 1342805269 1342971300
0 1355767102 24 3657686 1 54106344 2 n
1342969424 1342805289 1342805289 1342971300
0 1355767102 24 3657691 1 46016001 2 n
1342969424 1342805296 1342805296 1342971300
0 1355767102 8 3686839 11 1.9917E+10 1 y
1343722719 1343696227 1343753402 1343753570
0 1355767102 24 3657692 1 60873724 2 n
1342969424 1342805297 1342805297 1342971300
0 1355767102 24 3657672 1 31435623 2 n
1342969424 1342805270 1342805270 1342971300
0 1355767102 24 3657543 1 43806230 2 n
1342969409 1342802936 1342802936 1342971300
0 1355767102 24 3657674 1 31433867 2 n
1342969424 1342805273 1342805273 1342971300
0 1355767102 24 3657675 1 31361729 2 n
1342969424 1342805274 1342805274 1342971300
0 1355767102 24 3657677 1 46829543 2 n
1342969424 1342805277 1342805277 1342971300
0 1355767102 24 3657680 1 46569738 2 n
1342969424 1342805281 1342805281 1342971300
0 1355767102 4 3874498 1 174217294 1 y
1350019896 1350019896 1350086042 1350085977
0 1355767102 2 3743493 1 4950350 0 y
1345515624 1345515624 1345516989 1345517376
0 1355767102 1262 3936808 1 556395659 1 y
1351919168 1351919168 1352005667 1352005627
0 1355767102 2 3764365 1 7836549 0 y
1346258794 1346258794 1346260597 1346260700
0 1355767102 17 3770380 1 2287328206 1 y
1346399083 1346399083 1346449258 1346449499
0 1355767102 17 3793655 2 4665897656 1 y
1347201264 1347173410 1347223116 1347223219
0 1355767102 17 3794168 10 3.3390E+10 1 y
1347201258 1347186312 1347236834 1347237212
0 1355767102 17 3793827 5 1.5897E+10 1 y
1347201278 1347179758 1347239756 1347239960
0 1355767102 17 3794609 1 599354169 1 y
1347201272 1347191207 1347241496 1347241938
0 1355767102 1262 3941498 2 1269575284 1 y
1352095276 1352095276 1352181592 1352181832
0 1355767102 19 3893199 238 8.2214E+11 1 y
1350641252 1350563777 1350630286 1350642452
0 1355767102 19 3893200 40 230753022 1 y
1350641252 1350563777 1350630286 1350642452
0

36 rows selected.

SQL> ed Wrote file afiedt.buf

1 update t_dps_block_latency set 2 PARTIAL_SUSPEND_TIME=NULL, 3 TOTAL_SUSPEND_TIME=nvl(partial_suspend_time,0) + nvl(total_suspend_time,0), 4 latest_replica = NULL, 5 last_suspend = NULL, 6 latency = last_replica - greatest(block_create,time_subscription) - (nvl(total_suspend_time,0)+nvl(partial_suspend_time,0)) 7* where latency is null and last_replica is not null SQL> r 1 update t_dps_block_latency set 2 PARTIAL_SUSPEND_TIME=NULL, 3 TOTAL_SUSPEND_TIME=nvl(partial_suspend_time,0) + nvl(total_suspend_time,0), 4 latest_replica = NULL, 5 last_suspend = NULL, 6 latency = last_replica - greatest(block_create,time_subscription) - (nvl(total_suspend_time,0)+nvl(partial_suspend_time,0)) 7* where latency is null and last_replica is not null

36 rows updated.

SQL> select * from t_dps_block_latency where latency is null and last_replica is not null;

no rows selected

SQL> commit; -verbatim-

reposted with proper formatting n