Closed belforte closed 5 months ago
side task
/afs/cern.ch/user/b/belforte/WORK/DBS/migDbg.py
and put in GHthis happened again https://cms-talk.web.cern.ch/t/crab-job-pubilcation-failed/39892 and again I lost a couple hours on it.
Better fix the code.
take this change to
dump of most recent error in human format (i.e. pprint)
{'error': {'code': 125,
'function': 'dbs.migrate.SubmitMigration',
'message': 'Migration request '
'/RelValTTbar_14TeV/CMSSW_14_0_0_pre3-PU_140X_mcRun4_realistic_v1_STD_2026D98_PU-v1/GEN-SIM-DIGI-RAW#c627abb1-b2eb-49a5-9c9f-174fbd82ca24, '
'id=0',
'reason': 'migration request '
'/RelValTTbar_14TeV/CMSSW_14_0_0_pre3-PU_140X_mcRun4_realistic_v1_STD_2026D98_PU-v1/GEN-SIM-DIGI-RAW#c627abb1-b2eb-49a5-9c9f-174fbd82ca24 '
'is already exist in DB with id=4435739',
'stacktrace': '\\ngoroutine 1684332 '
'[running]:\\ngithub.com/dmwm/dbs2go/dbs.Error({0xae4bc0?, '
'0xc00063a9c0?}, 0x7d, {0xc00069a140, 0xa0}, '
'{0xa0d471, '
'0x1b})\\n\\t/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 '
'+0x99\\ngithub.com/dmwm/dbs2go/dbs.(*API).SubmitMigration(0xc000196000)\\n\\t/go/src/github.com/vkuznet/dbs2go/dbs/migrate.go:679 '
'+0x5e5\\ngithub.com/dmwm/dbs2go/web.DBSPostHandler({0xae81b0, '
'0xc000640108}, 0xc000d4aa00, {0x9fb1c9, '
'0x6})\\n\\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:561 '
'+0x148a\\ngithub.com/dmwm/dbs2go/web.MigrationSubmitHandler({0xae81b0?, '
'0xc000640108?}, '
'0x454134?)\\n\\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:968 '
'+0x2f\\nnet/http.HandlerFunc.ServeHTTP(0x890000c000c02c01?, '
'{0xae81b0?, 0xc000640108?}, '
'0x0?)\\n\\t/usr/local/go/src/net/http/server.go:2109 '
'+0x2f\\ngithub.com/dmwm/dbs2go/web.limitMiddleware.func1({0xae81b0?, '
'0xc000640108?}, '
'0x0?)\\n\\t/go/src/github.com/vkuznet/dbs2go/web/middlewares.go:111 '
'+0x38\\nnet/http.HandlerFunc.ServeHTTP(0x94c100?, '
'{0xae81b0?, 0xc000640108?}, '
'0x11?)\\n\\t/usr/local/go/src/net/http/serve'},
'exception': 400,
'http': {'code': 400,
'method': 'POST',
'path': '/dbs/prod/phys03/DBSMigrate/submit',
'remote_addr': '10.100.164.0:47826',
'timestamp': '2024-04-25 08:12:54.165349637 +0000 UTC '
'm=+1294416.993904549',
'user_agent': 'DBSClient/Unknown/',
'x_forwarded_for': '137.138.157.32',
'x_forwarded_host': 'dbs-prod2.cern.ch'},
'message': 'DBSError Code:125 Description:DBS Migration error '
'Function:dbs.migrate.SubmitMigration Message:Migration request '
'/RelValTTbar_14TeV/CMSSW_14_0_0_pre3-PU_140X_mcRun4_realistic_v1_STD_2026D98_PU-v1/GEN-SIM-DIGI-RAW#c627abb1-b2eb-49a5-9c9f-174fbd82ca24, '
'id=0 Error: migration request '
'/RelValTTbar_14TeV/CMSSW_14_0_0_pre3-PU_140X_mcRun4_realistic_v1_STD_2026D98_PU-v1/GEN-SIM-DIGI-RAW#c627abb1-b2eb-49a5-9c9f-174fbd82ca24 '
'is already exist in DB with id=4435739',
'type': 'HTTPError'}
notice that the document["error"]["code"] == 125
, which matches the list of error codes from dbs [1]. Maybe this issue is a good place to start with https://github.com/dmwm/CRABServer/issues/7469
[1] https://github.com/dmwm/dbs2go/blob/28e02bde209af797af7d59d4f7e3baba25a98605/dbs/errors.go#L64
Thanks Dario,
I am aware. And current code already uses the error
dictionary above. But there is no use for the DBS error code here since it is the useless "Migration failure" and I have to parse the reason
string :-(
As much as I hate to parse messages, I'd rather not ask for a change in semantic where the API returns success and reports the existing migrationId, as the old server was doing. I tried to "protect us against changes" via https://github.com/dmwm/dbs2go/pull/112 . Let's hope DBS maintainer(s) can at least merge that.
all of this is now in my branch https://github.com/belforte/CRABServer/tree/deal-with-failed-migrations-8244 and at least Publisher starts (on crab-dev-tw01). I will try some publications. As usual it is not easy to test migrations, and in particular failed migrations !
my test task got 50 files correctly published. So at least code is not badly broken by the changes to unify DBS access in the common PublisherDbsUtils.
put my branch on preprod and ran a couple Jenkins ST tests. publication for FTS ASO (TaskPublish.py) appears all OK (of course no migrations). But I get unexptected problems with Rucio ASO. That thread is now moved to #8376
In the meanwhile I am declaring changes to TaskPublish.py (the FTS one) tested enought to make a PR.
I still need to find a way to test migrations. But do not know how to quickly find a dataset which nobody used yet !
Closed via #8378
I could not test failed migrations, but at worst they will be still broken and I will debug when I have an example
see https://github.com/dmwm/CRABServer/issues/7469#issuecomment-1949403081
I did this [1] in a python shell in the Publisherl, following example in /afs/cern.ch/user/b/belforte/WORK/DBS/migDbg.py [2] NOTE need to change from
cmsweb
tocmsweb-prod
. migration server does not run on the "for users" cmsweb cluster[1] I replayed the migration request that failed in the TaskPublish script, then deleted the exisint migrationId and submitted again, eventually the new migration failed with
status 4
which means "block already at destination", which is just fine. So I manually ran TaskPublish for that task and everything went OK[2]
cat migDbg.py