Open todor-ivanov opened 8 months ago
Ok now,
Since we have accumulated some steps already working in iterations with @germanfgv on that issue in order to provide this functionality for the T0 replays, I'd like to put some activity records here:
dbs
servers in K8 do not start with the static lexicon configuration, but they rather take it from services_config
- here is the startup log for one of them:
$ kubectl -n dbs logs dbs2go-global-w-79fd47d745-jdfrm
Defaulted container "dbs2go-global-w" out of: dbs2go-global-w, dbs2go-global-w-filebeat
TNS_ADMIN=/etc/tnsnames.ora
start with /etc/secrets/dbsconfig.json
$ grep lexicon /etc/secrets/dbsconfig.json
"lexicon_file": "/etc/secrets/lexicon.json"
dbs
instances in preprod
changing the respective regexp for the data_tier
fields: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/246/diffs#8cf13424fa3ae3173e9252fa46377dbc7bd6cd91data_tier
, PD
, etc.... (as Valentin explains in the WMCore issue) and those which are specific on the file level. And when an insertion in bulk is happening this leads to breaking the Oracle database unique constraint and throwing the ORA000001 error. dbs
side which throws a Concurrency error
to the clients, which is exactly what the clients see at the end... but at the bottom still sits a Unique Constraint
broken error from Oracle.2024-04-04 13:05:58,020:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac
2024-04-04 13:05:58,094:140283221800704:ERROR:DBSUploadPoller:Error trying to process block /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac through DBS. Details: DBSError code: 110, message: 6d0f69d1208bcc4d2f65cdc5abafe8c5196e64d1e1d695a534
8eb382403935a2 unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.inse
rtFilesViaChunks Message: Error: concurrency error
dbs
side for a similar record were showing the ORA00001 error:
[2024-04-03 00:01:17.850781226 +0000 UTC m=+456728.830418024] files.go:288: Insert Files file_id=3084044237 lfn=/store/backfill/1/data/Tier0_REPLAY_2024/MinimumBias/DQMIO/PromptReco-v2818455
7/000/377/880/00000/477FFB68-727F-45BF-81F2-1D60C8B3CC54.root
[2024-04-03 00:01:17.851609624 +0000 UTC m=+456728.831246425] blocks.go:181: Insert Blocks
INSERT INTO cms_dbs3_k8s_global_owner.BLOCKS
(block_id,block_name,dataset_id,
open_for_writing,origin_site_name,
block_size,file_count,
creation_date,create_by,
last_modification_date,last_modified_by)
VALUES
(:block_id,:block_name,:dataset_id,
:open_for_writing,:origin_site_name,
:block_size,:file_count,
:creation_date,:create_by,
:last_modification_date,:last_modified_by)
&{BLOCK_ID:32917742 BLOCK_NAME:/HLTPhysics/Tier0_REPLAY_2024-PromptReco-v28184557/AOD#aaecec0b-d16d-4c6e-9027-5920685dc53e DATASET_ID:14839867 OPEN_FOR_WRITING:0 ORIGIN_SITE_NAME:T0_CHCERN Disk BLOCK_SIZE:2079405 FILE_COUNT:1 CREATION_DATE:1711650349 CREATE_BY:T0Replay LAST_MODIFICATION_DATE:1711650349 LAST_MODIFIED_BY:T0Replay} [2024-04-03 00:01:17.853299409 +0000 UTC m=+456728.832936205] files.go:312: unable to insert files, error ORA-00001: unique constraint (CMS_DBS3_K8S_GLOBAL_OWNER.TUC_FL_LOGICAL_FILE_NAME) vi olated
* Strangely enough, after 24 hours all dataset records were injected, modulo one, and the only error we were left with, was related to one particular block: `/L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac`. Log records from the agent side were:
2024-04-04 12:30:50,017:140283221800704:INFO:DBSUploadPoller:Executing retrieveBlocks method... 2024-04-04 12:30:50,018:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /Cosmics/Tier0_REPLAY_2024-PromptReco-v3182029/DQMIO#38d97107-0bd6-4a61-881c-9bd08afb101b 2024-04-04 12:30:50,018:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /Cosmics/Tier0_REPLAY_2024-TkAlCosmics0T-PromptReco-v3182029/ALCARECO#1bbe16a4-bcad-4f92-928b-a4fdad226bfa 2024-04-04 12:30:50,020:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /MinimumBias/Tier0_REPLAY_2024-SiStripCalZeroBias-PromptReco-v3182029/ALCARECO#219c8a0f-3208-4314-86b6-092e8289c16e 2024-04-04 12:30:50,179:140283221800704:ERROR:DBSUploadPoller:Error trying to process block /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac through DBS. Details: DBSError code: 110, message: 6d0f69d1208bcc4d2f65cdc5abafe8c5196e64d1e1d695a534 8eb382403935a2 unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.inse rtFilesViaChunks Message: Error: concurrency error 2024-04-04 12:30:50,180:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /NoBPTX/Tier0_REPLAY_2024-PromptReco-v3182029/NANOAOD#88aaca0d-5912-4a6d-a7bf-174374986b43 2024-04-04 12:30:50,339:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /NoBPTX/Tier0_REPLAY_2024-LogErrorMonitor-PromptReco-v3182029/USER#c2a880cb-5d18-4839-ae85-5c197501bf66 ...
2024-04-04 12:30:51,584:140283221800704:INFO:DBSUploadPoller:About to call insert block for: /NoBPTX/Tier0_REPLAY_2024-PromptReco-v3182029/DQMIO#66eb027f-bfe9-4a8e-b37b-cdbb07823043 2024-04-04 12:30:51,658:140283221800704:INFO:DBSUploadPoller:Updating database record with files injected into DBS 2024-04-04 12:30:51,663:140283221800704:INFO:DBSUploadPoller:Updating 40 blocks successfully injected into DBS
* Searching for the errors on this particular block, I found that this particular file injection fails from this piece of code:
https://github.com/dmwm/dbs2go/blob/28e02bde209af797af7d59d4f7e3baba25a98605/dbs/files.go#L280
and then later, the field value is validated here https://github.com/dmwm/dbs2go/blob/28e02bde209af797af7d59d4f7e3baba25a98605/dbs/files.go#L324C26-L324C43 and if one goes and checks how it is validated ... here: https://github.com/dmwm/dbs2go/blob/28e02bde209af797af7d59d4f7e3baba25a98605/dbs/validator.go#L359 it becomes immediately obvious that the actual field consulted for validating the file level injections is not taken/constructed from the `data_tier` ... and other higher level data fields definitions, but rather directly as a separately defined one called `logical_file_name`, which is taken from here: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/blob/db98113bab95485c6b5f1ec96002e1ec00c7bae3/dbs2go-global-m/lexicon.json#L3-15
* The error that I observed on the server side which led me to that conclusion was:
[2024-04-05 00:03:18.702591251 +0000 UTC m=+629646.970417431] dataset_output_mod_configs.go:68: Insert DatasetOutputModConfigs INSERT INTO cms_dbs3_k8s_global_owner.DATASET_OUTPUT_MOD_CONFIGS (ds_output_mod_conf_id,dataset_id,output_mod_config_id) VALUES (:ds_output_mod_conf_id,:dataset_id,:output_mod_config_id)
&{DS_OUTPUT_MOD_CONF_ID:38366926 DATASET_ID:14839992 OUTPUT_MOD_CONFIG_ID:25163930}
[2024-04-05 00:03:18.704100146 +0000 UTC m=+629646.971926341] dataset_output_mod_configs.go:72: unable to insert DatasetOutputModConfigs
&{BLOCK_ID:32919159 BLOCK_NAME:/L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac DATASET_ID:14839992 OPEN_FOR_WRITING:0 ORIGIN_SITE_NAME:T0_CH_CERN_Disk BLOCK_SIZE:691037 FILE_COUNT:2 CREATION_DATE:1712162019 CREATE_BY:T0Replay LAST_MODIFICATION_DATE:1712162019 LAST_MODIFIED_BY:T0Replay}
[2024-04-05 00:03:18.709809489 +0000 UTC m=+629646.977635686] bulkblocks2.go:995: insertFilesViaChunks processed 1 goroutines with ids [3084164837 3084164877], elapsed time 733.305µs
[2024-04-05 00:03:18.710679899 +0000 UTC m=+629646.978506084] files.go:282: unable to validate record DBSError Code:114 Description:DBS validation error when wrong pattern is provided Function:dbs.files.Validate Message: Error: nested DBSError Code:114 Description:DBS validation error when wrong pattern is provided Function:dbs.CheckPattern Message:invalid pattern for key=logical_file_name Error: invalid parameter(s)
[2024-04-05 00:03:18.711000322 +0000 UTC m=+629646.978826502] bulkblocks2.go:1002: fail to insert files chunks, trec &{IsFileValid:1 DatasetID:14839992 BlockID:32919159 CreationDate:1712275398 CreateBy:/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0 FilesMap:{mu:{state:0 sema:0} read:{v:
goroutine 4083874 [running]: github.com/dmwm/dbs2go/dbs.Error({0xae4c20?, 0xc0009e2b90?}, 0x6e, {0xc00097c000, 0xe6}, {0xa1ae29, 0x2b}) /go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99 github.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc00035e000) /go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:744 +0x16ef github.com/dmwm/dbs2go/web.DBSPostHandler({0xae81b0, 0xc00060c090}, 0xc0006ee200, {0x9feab6, 0xa}) /go/src/github.com/vkuznet/dbs2go/web/handlers.go:542 +0x1274 github.com/dmwm/dbs2go/web.BulkBlocksHandler({0xae81b0?, 0xc00060c090?}, 0x454134?) /go/src/github.com/vkuznet/dbs2go/web/handlers.go:957 +0x45 net/http.HandlerFunc.ServeHTTP(0xe10000c00043ca01?, {0xae81b0?, 0xc00060c090?}, 0x0?) /usr/local/go/src/net/http/server.go:2109 +0x2f github.com/dmwm/dbs2go/web.limitMiddleware.func1({0xae81b0?, 0xc00060c090?}, 0x0?) /go/src/github.com/vkuznet/dbs2go/web/middlewares.go:111 +0x38 net/http.HandlerFunc.ServeHTTP(0x94c100?, {0xae81b0?, 0xc00060c090?}, 0x11?) /usr/local/go/src/n
* It turns out the so created `Concurrency Error` handler, can mask not only the ORA00001 error from the database, but also validation errors coming from the `dbs` server code base.
* I am now about to change and reconfigure the `logical_file_name` field at the lexicon and we will retry the injections.
Aand the results are more than positive:
from @germanfgv 's report on the agent side:
2024-04-05 18:31:38,997:140175970297600:DEBUG:DBSUploadPoller:Creating dbsAPI with address https://cmsweb-testbed.cern.ch/dbs/int/global/DBSWriter
2024-04-05 18:31:38,997:140175970297600:DEBUG:DBSUploadPoller:Found block /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac in blocks
2024-04-05 18:31:38,999:140175970297600:INFO:DBSUploadPoller:Queueing block for insertion: /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac
2024-04-05 18:31:38,999:140175970297600:DEBUG:DBSUploadPoller:Creating dbsAPI with address https://cmsweb-testbed.cern.ch/dbs/int/global/DBSWriter
2024-04-05 18:31:38,999:140175970297600:INFO:DBSUploadPoller:Executing retrieveBlocks method...
2024-04-05 18:31:39,000:140175970297600:INFO:DBSUploadPoller:About to call insert block for: /L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac
2024-04-05 18:31:39,306:140175970297600:DEBUG:DBSUploadPoller:Got a block to close
2024-04-05 18:31:39,306:140175970297600:INFO:DBSUploadPoller:Updating database record with files injected into DBS
2024-04-05 18:31:39,330:140175970297600:INFO:DBSUploadPoller:Updating 1 blocks successfully injected into DBS
from dbs writter logs:
[2024-04-05 16:31:39.064563417 +0000 UTC m=+61.069350939] dataset_output_mod_configs.go:72: unable to insert DatasetOutputModConfigs <nil>
[2024-04-05 16:31:39.254616254 +0000 UTC m=+61.259403765] blocks.go:181: Insert Blocks
INSERT INTO cms_dbs3_k8s_global_owner.BLOCKS
(block_id,block_name,dataset_id,
open_for_writing,origin_site_name,
block_size,file_count,
creation_date,create_by,
last_modification_date,last_modified_by)
VALUES
(:block_id,:block_name,:dataset_id,
:open_for_writing,:origin_site_name,
:block_size,:file_count,
:creation_date,:create_by,
:last_modification_date,:last_modified_by)
&{BLOCK_ID:32919750 BLOCK_NAME:/L1Accept/Tier0_REPLAY_2024-v3182029/L1SCOUT#3a331615-f7c9-4823-91c9-c7a6bae15aac DATASET_ID:14839992 OPEN_FOR_WRITING:0 ORIGIN_SITE_NAME:T0_CH_CERN_Disk BLOCK_SIZE:691037 FILE_COUNT:2 CREATION_DATE:1712162019 CREATE_BY:T0Replay LAST_MODIFICATION_DATE:1712162019 LAST_MODIFIED_BY:T0Replay} [2024-04-05 16:31:39.260464202 +0000 UTC m=+61.265251713] bulkblocks2.go:995: insertFilesViaChunks processed 1 goroutines with ids [3084212117 3084212157], elapsed time 828.803µs [2024-04-05 16:31:39.267165923 +0000 UTC m=+61.271953803] files.go:288: Insert Files file_id=3084212117 lfn=/store/backfill/1/data/Tier0_REPLAY_2024/L1Accept/L1SCOUT/v3182029/000/378/788/00000/7fa128e6-a29f-4c2d-8c25-2ebb81e54774.root [2024-04-05 16:31:39.277333904 +0000 UTC m=+61.282121413] files.go:288: Insert Files file_id=3084212157 lfn=/store/backfill/1/data/Tier0_REPLAY_2024/L1Accept/L1SCOUT/v3182029/000/378/788/00000/bdebda18-d654-4731-8a9e-18dd0b47403f.root [2024-04-05 16:31:39.279887456 +0000 UTC m=+61.284674964] bulkblocks2.go:808: 02adabb6b1bebd9e8845ffe10e6cf08ea44515718fd44085a338cdf181bd5ec8 insert FileLumi list sequentially 1 records [2024-04-05 16:31:39.283216813 +0000 UTC m=+61.288004337] bulkblocks2.go:808: 02adabb6b1bebd9e8845ffe10e6cf08ea44515718fd44085a338cdf181bd5ec8 insert FileLumi list sequentially 1 records
* and also das now shoes all file details
Todor, what is worry me is this line:
[2024-04-05 16:31:39.279887456 +0000 UTC m=+61.284674964] bulkblocks2.go:808: 02adabb6b1bebd9e8845ffe10e6cf08ea44515718fd44085a338cdf181bd5ec8 insert FileLumi list sequentially 1 records
which claims that FileLumi list was inserted sequentially. We should not degrade DBS code back to sequential processing. But yet, somehow it is the case. It may be FileLumiList is small enough, i.e. it is a last chunk, and I would appreciate if you'll check that.
Hi @vkuznet it was a test replay from T0 and we know for sure it was a short one. The whole block consist only of two fairly short files.
This issue is a followup of https://github.com/dmwm/WMCore/issues/11931 tracking the change on the DBS side.