DBSWriter fails with "fail to inject FL chunk"

belforte commented 2 years ago

Hi @vkuznet , sorry to report a new problem. (I think the missing lumis still is more important !).

Can you find why this block write fails and what should be done to fix it ? As per your instructions I list here the example of block insert with curl using a file from my AFS public area, you should be able to copy/paste. For clarity, the file is /afs/cern.ch/user/b/belforte/public/BadBlocks/fail_to_inject_FL_chunk.json

belforte@lxplus8s18/BadBlocks> curl --cert $X509_USER_PROXY --key $X509_USER_PROXY -H "Content-type: application/json" -d@/afs/cern.ch/user/b/belforte/public/BadBlocks/fail_to_inject_FL_chunk.json https://cmsweb.cern.ch/dbs/prod/phys03/DBSWriter/bulkblocks [{"error":{"reason":"DBSError Code:110 Description:DBS DB insert record error Function:dbs.filelumis.insertFLChunk Message:fail to inject FL chunk Error: nil","message":"6bf548cb1ead5b67471c907a06c890cc06684240c1eb1331d2a88bb2b283a24a unable to insert FileLumis records for /store/user/jinfeng/CrossSection/MC2016/Tight/Pythia8_DPStoJJ/MC2016_SKIM_JinfengLiu_Tight/220615_103610/0000/BPHSkim_86.root, fileID 1115340921, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.filelumis.insertFLChunk Message:fail to inject FL chunk Error: nil","function":"dbs.bulkblocks.InsertBulkBlocksConcurrently","code":110},"http":{"method":"POST","code":400,"timestamp":"2022-06-22 14:13:45.676802433 +0000 UTC m=+68891.739753330","path":"/dbs/prod/phys03/DBSWriter/bulkblocks","user_agent":"curl/7.59.0","x_forwarded_host":"cmsweb-k8s-prodsrv.cern.ch","x_forwarded_for":"137.138.154.184","remote_addr":"10.100.196.128:51854"},"exception":400,"type":"HTTPError","message":"DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:6bf548cb1ead5b67471c907a06c890cc06684240c1eb1331d2a88bb2b283a24a unable to insert FileLumis records for /store/user/jinfeng/CrossSection/MC2016/Tight/Pythia8_DPStoJJ/MC2016_SKIM_JinfengLiu_Tight/220615_103610/0000/BPHSkim_86.root, fileID 1115340921, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.filelumis.insertFLChunk Message:fail to inject FL chunk Error: nil Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.filelumis.insertFLChunk Message:fail to inject FL chunk Error: nil"}]belforte@lxplus8s18/BadBlocks>

our reference: https://github.com/dmwm/CRABServer/issues/7286

belforte commented 2 years ago

side note, in same directory I have put also the gzipped version /afs/cern.ch/user/b/belforte/public/BadBlocks/fail_to_inject_FL_chunk.json.gz but when I try the gzip post as per https://github.com/dmwm/dbs2go/blob/master/docs/DBSWriter.md I get an odd message about unable to get gzip reader, did I do something wrong ? Or is this something at your end ?

belforte@lxplus8s18/BadBlocks> curl --cert $X509_USER_PROXY --key $X509_USER_PROXY -H "Content-type: application/json" -H "Content-Encoding: gzip" --data-binary -d@/afs/cern.ch/user/b/belforte/public/BadBlocks/fail_to_inject_FL_chunk.json.gz https://cmsweb.cern.ch/dbs/prod/phys03/DBSWriter/bulkblocks [{"error":{"reason":"gzip: invalid header","message":"unable to get gzip reader","function":"web.DBSPostHandler","code":120},"http":{"method":"POST","code":500,"timestamp":"2022-06-22 14:16:28.908817602 +0000 UTC m=+69036.132424787","path":"/dbs/prod/phys03/DBSWriter/bulkblocks","user_agent":"curl/7.59.0","x_forwarded_host":"cmsweb-k8s-prodsrv.cern.ch","x_forwarded_for":"137.138.154.184","remote_addr":"10.100.196.128:39040"},"exception":500,"type":"HTTPError","message":"DBSError Code:120 Description:DBS reader I/O error, e.g. unable to read HTTP POST payload Function:web.DBSPostHandler Message:unable to get gzip reader Error: gzip: invalid header"}]belforte@lxplus8s18/BadBlocks> belforte@lxplus8s18/BadBlocks>

vkuznet commented 2 years ago

@belforte , you reported several issues and dump them into the same github repo. Let me break down them one by one:

there is an issue with data block injection, and this is this issue
there is an issue with gzip content inject and it is separate issue, but you mentioned here
there is an DBSClient (python based) issue and you still reported it int dbs2go repo (https://github.com/dmwm/dbs2go/issues/54)

So, please open only one issue which will be easy to trace and report back. Please open issues in correct repo, the DBSClient python based issue is not dbs2go and has nothing to do with it.

Finally, please read careful DBS client usage over here which states that for json input you should use:

curl -L -k --key ~/.globus/userkey.pem --cert ~/.globus/usercert.pem \
    -H "Content-type: application/json" -H "Accept: application/json" \
    -d@/tmp/upload.json \
     https://xxx.cern.ch/dbs2go/fileArray

please note usage of -d flag, while for gzip input you should use

curl -L -k --key ~/.globus/userkey.pem --cert ~/.globus/usercert.pem \
    -H "Content-type: application/json" -H "Accept: application/json" \
    -H "Content-Encoding: gzip" --data-binary @$PWD/b.json.gz \
     https://xxx.cern.ch/dbs2go/bulkblocks

there is no -d flag, but you need to use --data-binary one. This should resolve your question/issue with gzip encoded input.

Now, I'll concentrate in here only on block injection. I only need to ask one question. According to https://github.com/dmwm/CRABServer/issues/7286 it is happened 13 days ago, right? And, you repeated the injection again today with curl command, right?

vkuznet commented 2 years ago

I don't know if I'll have time to debug it today, and tomorrow till next Tue Jun 28th I'm taking time off. Therefore, please be patient with my responses as most likely I can look at this issue only later next week.

belforte commented 2 years ago

thanks @vkuznet . Yes, it can wait. Usual "we monitor our logs and found it, but users did not realize".

Sorry for #54, I moved it to correct repository. Sorry for not having payed proper attention to different format for posting via curl of zipped files, indeed now I get same error message as with plain json, as it should be.

Yes, I was testing this today with the block which is now in my AFS public, but that is different from the original one 13 days ago, I was in vacation at the time and could not look at it right away. Anyhow you should be able to reproduce it at any time.

vkuznet commented 2 years ago

Thanks. Meanwhile I looked at the code and the issue is related to this block: https://github.com/dmwm/dbs2go/blob/master/dbs/filelumis.go#L334-340 which should not return error. It seems to me that you block content is such that the last chunk does not have any lumis in it and it hits this block and entire transaction fails. I'll fix it once I'll get time.

vkuznet commented 2 years ago

fix is deployed on testbed, but I can't test it with your data due to missing parent blocks for you JSON in testbed DB. Please see possible solution in https://github.com/dmwm/CRABServer/issues/7286#issuecomment-1163507876

vkuznet commented 2 years ago

@belforte , after many many iterations and debugging I finally managed to fix parentage issue in DBSMigration server. There were few issues:

the DBSWriter server has not had relevant information about block parentage
the order issue with migration request, i.e. parents should be processed before children
and, very tricky, data-type differences between SQLite and ORACLE for fetching table id's

I deployed new version of DBSReader/DBSWriter/DBSMigrate/DBSMigration servers to testbed and tested the migration process using /GenericTTbar/HC-CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/AODSIM which has GEN-SIM-RAW and GEN-SIM parentage chain. In my local setup I successfully migrated this dataset along with his parents. For that I was used new DBSReader deployed on testbed. I still need to do more tests and validation, but would like to let you know that this issue is moving (although slowly) forward.

belforte commented 2 years ago

WOW. Sounds like moving at breathtaking speed to me ! Thanks Valentin. I remember how much DBS2 and DBS3 developers struggled with migrations, it is not a trivial issue. Let's give it the time it takes.

vkuznet commented 2 years ago

@belforte I just deployed new version of DBSReader/DBSWriter to cmsweb.cern.ch (but not yet to cmsweb-prod.cern.ch). The new version should fix this issue and issue with failed migration reported in https://github.com/dmwm/CRABServer/issues/7315, https://github.com/dmwm/CRABServer/issues/7290

Could you please try to use cmsweb to see if the issue is gone, and please try to migrate any of the AODSIM datasets (which suppose to have GEN-SIM-RAW and GEN-SIM parentage using cmsweb.cern.ch/dbs/prod/global/DBSReader and testbed DBSMigration server (cmsweb-testbed.cern.ch/dbs/int/phys03/DBSMigration).

I also fixed issue with dbsClient reported in https://github.com/dmwm/DBSClient/issues/71 but I need time to create new dbsclient release with that fix.

belforte commented 2 years ago

I am working on it. But have not been able to put much time on it yet.

belforte commented 2 years ago

I tested inserting again a handful of blocks which had failed with this error, using DBSWriter from cmsweb.cern.ch (not cmsweb-prod) and it worked smoothly. I see no point in doing all of them, are large blocks and takes quite some time. Do you think that changes are such to require extensive validation, like switching CRAB Publisher to use cmsweb reader/writer for just one or two iterations ? Or one day ?

I have tested one failed migration on testbed and it worked. But I think a bit more testing is useful here. I have to get back to parasitic replay with proper status checking. Please be patient.

vkuznet commented 2 years ago

@belforte , if you don't mind could you please turn on CRAB Publisher to make double migration as you did before, i.e. migrate dataset/blocks using existing server and using new Go one on testbed. I would like to see results after a day of running. If it successful then we'll schedule migration next week. The reason I ask is that I'll be away during Jul/Aug (Jul 7-8th, Jul 13-19th, Jul 28-Aug 15th) and I'll not be around if issue will arise during my absence and I want to deploy to production k8s new Go migration before I leave on vacation.

belforte commented 2 years ago

I am doing my best. I need to add code to check test migration status so that I get a more useful log. I have some family issues beside being slow and clumsy :-(

belforte commented 2 years ago

@vkuznet in the test migration,I also make sure to use Reader from cmsweb.cern.ch, not cmsweb-prod. Right ?

vkuznet commented 2 years ago

yes, so far you need to use DBSReader from cmsweb and NOT cmsweb-prod.

vkuznet commented 2 years ago

I already deployed new server which will have new code. It means that for newly created requests it will be applied, but I can't apply it for existing ones as it requires to go to DB level and adjust table values. And, migrationId=0 means that request just assigned, but it does not means it is an error. Your code should check for code 3 or 9 to determine the error. The former is migration failure (and you can see retry count) and later when DBSMigration server done all retries it should assign code 9.

belforte commented 2 years ago

enlightening ! thanks

belforte commented 2 years ago

if I try to migrate a block, and I get back migId=0 and status=9, i.e. migration was already requested and terminally failed. How do we proceed to find which migrationId it was so that it can be deleted and the migration tried again ? e.g. DBSReader could have been down for a while, or any other thing which is fixed by retrying. (I am currently ignoring this in the code I am writing, if I find migrationId=0 and status=9 I simply give up)

vkuznet commented 2 years ago

do you already have this use case or is it hypothetical example? When DBSMigrate server accept request and put it in DB it should have non-zero migration id. Moreover, you are the user who creates migration request in a first place, therefore you know your input. I just need to understand the use-case and how it happens to address it properly.

belforte commented 2 years ago

the problem here is your usage of "you". It is not like I request one migration every now and then by CLI and can track them in a notebook. You have seen the amount of requests CRAB makes, every dataset which people read will very likely need to be migrated to phys03. So there are multiple processes running in parallel which may request migration of same block and do not communicate lists of requests. Or one process may submit requests and crash for whatever reason. So another process will come up later and try again to migrate same block. Current code is not keeping its own DB of migration requests to be always uptodate on what it was done and make proper choice. Those events are surely rare, like migration failures are rare. But in view of having something that can run unattended for months, we should think about all possible glitches. Of course one may decide to get rid of migrations, or migrate entire datasets centrally whenever they are created, or many other things. But I'd rather not make big design changes now.

I have no problem in dealing with migrations which eventually succeed, current code simply skips publishing those data and will try again to publish in next iteration (one every 30min) when most likely it will find blocks alrady at destination, skip all migration stuff, and publish. The only problem is how to deal with terminally failed migrations. Which of course are exceptional cases tied to extreme circumstances (DBS reader down ?) so I have no specific example.

belforte commented 2 years ago

But I can bet money, that if we do nothing, at some point we will have some blocks which we can't migrate because and attempt to do that failed sometimes in the past and we do not know when, why, or who requested. Some test code ? Some debugging session gone wild ? Someone else trying to learn ?

vkuznet commented 2 years ago

Stefano, from your description I conclude that this is hypothetical example. As I wrote before, migid by default has value 0. If migration request was picked up and injected to DB then we have persistent history of requests. From that point it will have non zero id as ID is assigned by database.

Said that, I followed Python server logic, i.e. it has two servers, one DBSMigrate to provide migration APIs, and another daemon to process migration requests. There is also a third daemon process to clean-up DB whose logic I also borrowed from Python server.

As such, if implementation is complete we should not see migId=0 with migation_status=9. I suggest only to address this issue if we'll have concrete example which I can debug.

For completeness, the clean-up process is defined here and it uses this SQL template. As you can see from template the clean-up happens in 3 cases:

if migration modification date is more than 3 months (this is default value)
if migration request has status 3 and 3 retries
if migration request has status 9 and 3 retries According to this logic you should have plenty of time (3 months) to investigate failed migration, and even if they are gone they can be submitted again.

Do it answer your concern?

belforte commented 2 years ago

I tried to reply from my phone, but it did not work. Thanks for clarifying. Earlier discussion left me with the belief that submitMigration always returns migId=0 if a migration request for this block exists already. Now you are saying that in such cases it will instead return migId of the ongoing (or failed) migrations. I.e. same behavior of old migration server. That's fine. Well.. I hope I got it correctly now, let me know if not. Sorry for the misunderstanding. I know that writing detailed API specificactions is a pain, yet this exchanges shows how useful they are ! I know that I can be annoyingly pedantic at times, but when it comes to running at scale, devil is in the details, in those things which only happens once in a billion. I prefer to have as many details as possible clarified upfront.

belforte commented 2 years ago

I started testing (slowly, my code may be buggy). Here's a migration which failed and it is being retried. Is this expected in the end, or is it already a bad sign ? No terminally failed (status=9) yet:

HTTP=GET URL=https://cmsweb-testbed.cern.ch:8443/dbs/int/phys03/DBSMigrate method=status params={'migration_rqst_id': 465131} data={} headers={'Content-Type': 'application/json', 'Accept': 'application/json', 'UserID': 'belforte@lxplus8s23.cern.ch', 'User-Agent': 'DBSClient/Unknown/'}
[{'create_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'creation_date': 1657122737, 'last_modification_date': 1657122737, 'last_modified_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'migration_input': '/GluGluToHHTobbVV_node_cHHH5_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#94eee320-716d-42fa-881f-fcce17cd5797', 'migration_request_id': 465131, 'migration_status': 3, 'migration_url': 'https://cmsweb.cern.ch:8443/dbs/prod/phys03/DBSReader', 'retry_count': 1}]
{'create_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch',
 'creation_date': 1657122737,
 'last_modification_date': 1657122737,
 'last_modified_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch',
 'migration_input': '/GluGluToHHTobbVV_node_cHHH5_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#94eee320-716d-42fa-881f-fcce17cd5797',
 'migration_request_id': 465131,
 'migration_status': 3,
 'migration_url': 'https://cmsweb.cern.ch:8443/dbs/prod/phys03/DBSReader',
 'retry_count': 1}

belforte commented 2 years ago

same for migrationId 465138 still at retry number 1

belforte commented 2 years ago

dbs2go-phys03-migration log for testbed reports that parent(s) were not migrated:

[2022-07-06 15:56:03.118440069 +0000 UTC m=+90775.133916907] migrate.go:864: insert bulkblocks for mid 465131 error DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/mc/RunIISummer20UL17RECO/GluGluToHHTobbVV_node_cHHH5_TuneCP5_13TeV-powheg-pythia8/AODSIM/106X_mc2017_realistic_v6-v1/2530000/6E1538D3-A147-D843-BFA4-42662958816C.root Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set

belforte commented 2 years ago

more migrations failing with unable to find parent lfn ... e.g. [2022-07-06 16:07:12.92894629 +0000 UTC m=+91444.944423124] migrate.go:864: insert bulkblocks for mid 471331 error DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/mc/RunIISummer20UL16RECO/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/AODSIM/106X_mcRun2_asymptotic_v13-v2/00000/6A53EC95-D3B5-3D4D-9F4E-D433C4016B99.root Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set

will this be fixed autoamtically in following retries ?

belforte commented 2 years ago

Anyhow the new code seems not to affect normal operation, so I am not running at "normal speed" with parallel migration attempts to testbed. So far is submitted 560 migration requests, of which 342 are in "failed but will retry" and after 30min still are not moving beyond retry_count=1. 196 migrations have been found successful so far. Those numbers do not need to add up, some migration may be in status 0 or 1.

CRAB Publisher is not designed to follow every migrationId until complete, it gives up after 5min and when it tries again half an hour later it may simply find that needed parents are there and go straight to inserting block.

vkuznet commented 2 years ago

Stefano, I would like to double check that you do the following:

I take your failed request 471331 and check its status:

[
{
"create_by": "/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch",
"creation_date": 1657123389,
"last_modification_date": 1657123389,
"last_modified_by": "/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch",
"migration_input": "/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL16MiniAOD-106X_mcRun2_asymptotic_v13-v2/MINIAODSIM#6f974dae-834e-44c0-b1fe-c74aec11ae0f",
"migration_request_id": 471331,
"migration_status": 3,
"migration_url": "https://cmsweb.cern.ch:8443/dbs/prod/phys03/DBSReader",
"retry_count": 1
}
]

It seems you submit migration for MINIAODSIM block from cmsweb.cern.ch/dbs/prod/phys03/DBSReader to cmsweb-testbed.cern.ch/dbs/int/phys03/DBSMigrate

My question is why do you use cmsweb.cern.ch/dbs/prod/phys03/DBSReader as your migration URL and why not cmsweb.cern.ch/dbs/prod/global/DBSReader instance? I don't know the content of /prod/phys03 and as such I don't know if it has all parents. In other words why your migration_url is phys03 instance instead of global one? In my view, we always migrate from global DBS to another instance, e.g. to phys03 or to testbed, no?

vkuznet commented 2 years ago

anyway, in my local setup I can reproduce the problem. I'll have look, it seems like again the order of blocks I need to fix (similar to dataset order).

vkuznet commented 2 years ago

I identified the issue, indeed I miss the order of input block, the commit is here https://github.com/dmwm/dbs2go/commit/07170a3c4db2e50af32c24a93e8840591d344f12 and I'm awaiting the build to put it on testbed.

vkuznet commented 2 years ago

The fix in testbed now. Please do the following:

delete failed request 471331 as following or use your python APIs/tool:


cat > mig_request.json << EOF
{
"create_by":"/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch",
"migration_rqst_id":471331
}
EOF

curl -v -H "Content-type: application/json" \ -d@$PWD/mig_request.json \ https://cmsweb-testbed.cern.ch/dbs/int/phys03/DBSMigrate/remove

- submit it again as following or use your python API/tool:

cat > m.json << EOF { "migration_url": "https://cmsweb.cern.ch/dbs/prod/global/DBSReader", "migration_input": "/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL16MiniAOD-106X_mcRun2_asymptotic_v13-v2/MINIAODSIM#6f974dae-834e-44c0-b1fe-c74aec11ae0f" } EOF

curl -H "Content-Type: application/json" -d@$PWD/m.json \ https://cmsweb-testbed.cern.ch/dbs/int/phys03/DBSMigrate/submit

belforte commented 2 years ago

My question is why do you use cmsweb.cern.ch/dbs/prod/phys03/DBSReader as your migration URL and why not cmsweb.cern.ch/dbs/prod/global/DBSReader instance?

It is a very good question. I was asking myself the same as I was re-reading this code, whose design is untouched from DBS2 era (mechanics of interaction with DBS changed with DBS3 and now with dbs2go, but... same strategy). If parents where in phys03 already, there would be no need to migrate.

But I was staying on the side of "it works, don't fix it". Maybe it is time to do rewrite and cleanup ! Thanks for bringing this to the forefront.

belforte commented 2 years ago

The fix in testbed now. Please do the following:

thanks I'll do first thing in the morning. Sorry today was my daughter's birthday. Yeah.. I have scripts at hands to remove and retry migrations, no problem there.

belforte commented 2 years ago

My question is why do you use cmsweb.cern.ch/dbs/prod/phys03/DBSReader as your migration URL and why not cmsweb.cern.ch/dbs/prod/global/DBSReader instance?

Forget what I wrote earlier, I misstyped the URL to use :-( fixed now.

belforte commented 2 years ago

hmm... DBSClient is failing (risiing exception) with this: (I am using debug=True)

HTTP=POST URL=https://cmsweb-testbed.cern.ch:8443/dbs/int/phys03/DBSMigrate method=remove params={} data={"migration_rqst_id": 520003, "create_by": "/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch"} headers={'Content-Type': 'application/json', 'Accept': 'application/json', 'UserID': 'crab3@crab-prod-tw01.cern.ch', 'User-Agent': 'DBSClient/Unknown/'} The server output is not a valid json, most probably you have a typo in the url. https://cmsweb-testbed.cern.ch:8443/dbs/int/phys03/DBSMigrate.

I call it as

apiMig.removeMigration({'migration_rqst_id': migrationId, 'create_by': creator})

Yet it appears to work. Maybe it could return an empty list like other API's do ? When scripting the removal of O(500) migrations, I'd like it more to have it to raise only for real errors.

belforte commented 2 years ago

sorry @vkuznet , new error. This is migrationId 535027, others are failing same way

belforte@vocms0750/dbs-logs> grep ' 535027 ' dbs2go-phys03-migration-*.log-20220706
dbs2go-phys03-migration-8fb59bb86-cqpn6.log-20220706:[2022-07-06 21:29:41.420738621 +0000 UTC m=+11851.204601639] migrate.go:1067: update migration request 535027 to status 1
dbs2go-phys03-migration-8fb59bb86-cqpn6.log-20220706:[2022-07-06 21:29:42.587322929 +0000 UTC m=+11852.371185971] migrate.go:866: insert bulkblocks for mid 535027 error DBSError Code:109 Description:DBS DB ID error for provided entity, e.g. there is no record in DB for provided value Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:01286f19ad4ab216961584093e42ebfac5f676910df15a3780a7dc592494e341 unable to find dataset_id for /HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL17RECO-106X_mc2017_realistic_v6-v1/AODSIM, error DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set
dbs2go-phys03-migration-8fb59bb86-cqpn6.log-20220706:[2022-07-06 21:29:42.588177434 +0000 UTC m=+11852.372040450] migrate.go:1067: update migration request 535027 to status 3

migration details: HTTP=GET URL=https://cmsweb-testbed.cern.ch:8443/dbs/int/phys03/DBSMigrate method=status params={'migration_rqst_id': 535027} data={} headers={'Content-Type': 'application/json', 'Accept': 'application/json', 'UserID': 'belforte@lxplus8s23.cern.ch', 'User-Agent': 'DBSClient/Unknown/'}

{'create_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'creation_date': 1657142911, 'last_modification_date': 1657142911, 'last_modified_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'migration_input': '/HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM#92f25318-1797-43ea-a01e-02fda4b18908', 'migration_request_id': 535027, 'migration_status': 3, 'migration_url': 'https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader', 'retry_count': 1}

belforte commented 2 years ago

anyhow things are mostly succeeding now. I leave it running. I have removed all failed migrations (as of 10min ago) by ignoring the error above, CRAB Publisher will retry them as/if needed. I also noticed that about the earlier half of them were Done by now (status=2). Curious, but... no need to dig I think.

belforte commented 2 years ago

another unable to find parent dbs2go-phys03-migration-8fb59bb86-cqpn6.log-20220706:[2022-07-06 21:54:29.112995919 +0000 UTC m=+13338.896858936] migrate.go:866: insert bulkblocks for mid 545093 error DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2018C/SingleMuon/RAW/v1/000/319/854/00000/5E477234-438B-E811-A5C1-FA163E2A9C42.root Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set

yet that file is both in prod/global and prod/phys03 https://cmsweb.cern.ch/das/request?view=list&limit=50&instance=prod%2Fphys03&input=file+file%3D+%2Fstore%2Fdata%2FRun2018C%2FSingleMuon%2FRAW%2Fv1%2F000%2F319%2F854%2F00000%2F5E477234-438B-E811-A5C1-FA163E2A9C42.root

belforte commented 2 years ago

As CRAB Publisher tries again and again you can find more errors in your log, but it e.g. the latest one is still due to problems with unable to find parent lfn /store/data/Run2018B/SingleMuon/RAW/v1/..... Somehow something is wrong with that dataset. I will add code to skip it for testbed. Let's see if we catch some new fish.

belforte commented 2 years ago

the unable to find parent lfn /store/data/Run2018B/SingleMuon/RAW/v1/.... come from trying to migrate /SingleMuon/Run2018B-15Feb2022_UL2018-v1/AOD to int/phs03. There are 55 blocks in prod/global, but only 54 have migrated. problem block is /SingleMuon/Run2018B-15Feb2022_UL2018-v1/AOD#a1224fa9-5041-4ebe-b1f5-410e8e23da33 https://cmsweb.cern.ch/das/request?view=list&limit=50&instance=prod%2Fglobal&input=block+block%3D%2FSingleMuon%2FRun2018B-15Feb2022_UL2018-v1%2FAOD%23a1224fa9-5041-4ebe-b1f5-410e8e23da33

vkuznet commented 2 years ago

Stefano, thanks for providing examples. I took time off till Mon, therefore my response will be slow, and unfortunately I'm taking another time from Jul 13-19th. Anyway, I had a look at some of your blocks, and found weird things:

for block /HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM#92f25318-1797-43ea-a01e-02fda4b18908 there is no parent blocks, but dataset /HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM has parent dataset /HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL17RECO-106X_mc2017_realistic_v6-v1/AODSIM. I don't know that such cases exists in DBS, and in fact I don't understand them. Is it possible that we have no parents for a block but do have parent for a dataset? If we, indeed, have such cases I need to understand why, and then adjust algorithm accordingly.
regarding block /SingleMuon/Run2018B-15Feb2022_UL2018-v1/AOD#a1224fa9-5041-4ebe-b1f5-410e8e23da33 I was able to migrate it in my local setup and testing now the entire dataset which as 114 parent's blocks, it will take a while. So, you may skip it for a while and I'll come back to it after I get my own tests done next week.

belforte commented 2 years ago

thanks Valentin. The "dataset has parents but block has not" smells fishy. I do not know how exactly block and dataset parents are set in DBS, but AFAIK input is always to provide parents by file. Sounds an unexepcted internal inconsistency in data base. Maybe we found a case where internal bookkeeping went wrong. I suggest to contact Yuyi.

For the SingleMuon block... I have no suggestion to offer. Yes, I am trying not to submit more migration for that dataset to testbed.

In general, we seem to have run out of migrations to do at the moment. In any case no new odd case today. Well... we can always get the list of MINIAOD* in prod/phys03 and migrate them to int/phys03 as well, w/o waiting for users to trigger migrations. But that would be an enormous amount of stuff.

Please enjoy your time off, you deserve it.

belforte commented 2 years ago

FWIW, another problem with parents from /cephfs/testbed/dbs-logs/dbs2go-phys03-migration-f4687658c-xhmxg.log-20220708

[2022-07-08 08:13:49.931820559 +0000 UTC m=+43728.557544879] migrate.go:1065: update migration request 900087 to status 1
[2022-07-08 08:13:50.032448378 +0000 UTC m=+43728.658172684] migrate.go:864: insert bulkblocks for mid 900087 error DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/mc/RunIISummer20UL17RECO/VBFHToGG_M124_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_mc2017_realistic_v6-v3/80000/AF92F330-8A40-5E42-B720-3EA35C4B9746.root Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set
[2022-07-08 08:13:50.032757484 +0000 UTC m=+43728.658481790] migrate.go:1065: update migration request 900087 to status 3

from HTTP=GET URL=https://cmsweb-testbed.cern.ch:8443/dbs/int/phys03/DBSMigrate method=status params={'migration_rqst_id': 900087} data={} headers={'Content-Type': 'application/json', 'Accept': 'application/json', 'UserID': 'belforte@lxplus8s16.cern.ch', 'User-Agent': 'DBSClient/Unknown/'} [{'create_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'creation_date': 1657267970, 'last_modification_date': 1657267970, 'last_modified_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'migration_input': '/VBFHToGG_M124_TuneCP5_13TeV-amcatnlo-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM#0fcc114a-91d4-4c03-ae85-49c0d90744c4', 'migration_request_id': 900087, 'migration_status': 3, 'migration_url': 'https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader', 'retry_count': 1}] {'create_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'creation_date': 1657267970, 'last_modification_date': 1657267970, 'last_modified_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'migration_input': '/VBFHToGG_M124_TuneCP5_13TeV-amcatnlo-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM#0fcc114a-91d4-4c03-ae85-49c0d90744c4', 'migration_request_id': 900087, 'migration_status': 3, 'migration_url': 'https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader', 'retry_count': 1}

that missing parent LFN is present in global DBS belforte@lxplus8s16/~ dasgoclient --query 'file file=/store/mc/RunIISummer20UL17RECO/VBFHToGG_M124_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_mc2017_realistic_v6-v3/80000/AF92F330-8A40-5E42-B720-3EA35C4B9746.root' /store/mc/RunIISummer20UL17RECO/VBFHToGG_M124_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_mc2017_realistic_v6-v3/80000/AF92F330-8A40-5E42-B720-3EA35C4B9746.root belforte@lxplus8s16/~

I am not sure what Message:unable to find parent lfn /store/mc/RunIISummer20UL17RECO/VBFHToGG_M124_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_mc2017_realistic_v6-v3/80000/AF92F330-8A40-5E42-B720-3EA35C4B9746.root does refer to.

vkuznet commented 2 years ago

Stefano, over weekend I made two changes to new DBS migration service:

I adjusted /remove API to only accept migration request as you reported in #59
I found mis-order of blocks parents and fixed them in new server

I run few tests in my local setup and everything looks fine, except one specific use-case where block does not have parents while its dataset does. Therefore, it would be very helpful for me to know your report about migration requests statuses from CRAB Publisher starting Jul 10th at noon EDT (18:00 CERN time). I would appreciate if you still run double migration in CRAB publisher and let me know if we see migration failure. The weird use-case I suggest separately and I sent corresponding request to Yuyi (you are in CC) to clarify this use-case.

vkuznet commented 2 years ago

Stefano, based on Alan's input about /HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM being in PRODUCTION state. I adjusted DBS Migration code to check dataset status. If it is not VALID then the migration request will be rejected. Here is an example of output you'll get:

[{"error":{
"reason":"dataset /HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM has status PRODUCTION",
"message":"not allowed for migration",
"function":"dbs.migrate.SubmitMigration","code":125},
"http":{"method":"POST","code":400,"timestamp":"2022-07-11 09:14:22.51786 -0400 EDT m=+20.087895772","path":"/dbs2go-migrate/submit","user_agent":"curl/7.84.0",
"x_forwarded_host":"","x_forwarded_for":"",
"remote_addr":"127.0.0.1:63061"},"exception":400,
"type":"HTTPError",
"message":"DBSError Code:125 Description:DBS Migration error Function:dbs.migrate.SubmitMigration Message:not allowed for migration Error: dataset /HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM has status PRODUCTION"}]

I updated dbs migration server to new version which includes this change, therefore we should be covered for this use-case.

belforte commented 2 years ago

thanks !

belforte commented 2 years ago

About https://github.com/dmwm/dbs2go/issues/53#issuecomment-1179758574 I left the code for the additional migration running and there is no new failed migration report. OTOH it only tries migrations which are done in production, which only happens when someone processed a dataset which was never used before. I need to check if there were indeed new migration requests, if not I can look into migrating some datasets myselfs and see how it goes.

We want to have a good validation here. I would not feel confident with only a couple days.

vkuznet commented 2 years ago

that's totally fine by me, since I'll be away from Wed till next week we can use entire week to see how it goes.

vkuznet commented 2 years ago

Stefano, the original issue reported in this ticket is fixed and now in production. Therefore, I'm closing this ticket. We discussed here lots of stuff and I added requested changes to APIs, etc. If you feel otherwise please either open new issue or re-open this one and specify the new issue and adjust title of the ticket accordingly.

dmwm / dbs2go

DBSWriter fails with "fail to inject FL chunk" #53