dmwm / DBS

CMS Dataset Bookkeeping Service
Apache License 2.0
7 stars 21 forks source link

MIgration failures #605

Closed belforte closed 4 years ago

belforte commented 5 years ago

Sorry to add more problems. There have been a few, persistent, failures to migrate datasets from Global to Phys03. While it makes surely sense that migration can fail when Global can't be read, I'd like to see what exactly went wrong, if nothing else to know when to try again.

I lookeed at DBSmigration logs via vocms055, but they have no useful information, only a series of time stamps. Even when grepping for a given (failed) migration request id, I found only one line liting the Id, but no detail.

Is there maybe a verbosity level that could be changed, temporarely ?

examples of Dataset which failed to migrate:

/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM

/DYJetsToLL_M-50_TuneCUETHS1_13TeV-madgraphMLM-herwigpp/RunIISummer16MiniAODv3-PUMoriond17_94X_mcRun2_asymptotic_v3-v2/MINIAODSIM

belforte commented 5 years ago

this refers to these two user problem reports, there may be more: https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/4856.html https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/4859.html

belforte commented 5 years ago

I was too optimistic. Migrations keep failing even now that DBS is working, how can we debug/solve this ? Example (I submitted this half an hour ago or so):

{'migration_status': 9, 'create_by': 'service@vocms0105.cern.ch', 'migration_url': 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader', 'last_modified_by': 'service@vocms0105.cern.ch', 'creation_date': 1557774029, 'retry_count': 3, 'migration_input': '/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM#4a4e2ab5-1acc-4fae-9b5a-83c85ae7ffe4', 'migration_request_id': 2738818, 'last_modification_date': 1557775662}

belforte commented 5 years ago

and this is all that the logs which I can find have to say about it:

belforte@vocms055/srv-logs> grep 2738818 */dbsmigration/dbsmigration-20190513.log
vocms0136/dbsmigration/dbsmigration-20190513.log:--------------------getResource--  Mon May 13 19:13:38 2019 Migration request ID: 2738818
vocms0163/dbsmigration/dbsmigration-20190513.log:--------------------getResource--  Mon May 13 19:00:33 2019 Migration request ID: 2738818
vocms0163/dbsmigration/dbsmigration-20190513.log:--------------------getResource--  Mon May 13 19:06:35 2019 Migration request ID: 2738818
vocms0165/dbsmigration/dbsmigration-20190513.log:--------------------getResource--  Mon May 13 19:22:41 2019 Migration request ID: 2738818
yuyiguo commented 5 years ago

Let me check.

From: Stefano Belforte notifications@github.com Reply-To: dmwm/DBS reply@reply.github.com Date: Monday, May 13, 2019 at 3:07 PM To: dmwm/DBS DBS@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [dmwm/DBS] MIgration failures (#605)

and this is all that the logs which I can find have to say about it:

belforte@vocms055/srv-logs> grep 2738818 */dbsmigration/dbsmigration-20190513.log

vocms0136/dbsmigration/dbsmigration-20190513.log:--------------------getResource-- Mon May 13 19:13:38 2019 Migration request ID: 2738818

vocms0163/dbsmigration/dbsmigration-20190513.log:--------------------getResource-- Mon May 13 19:00:33 2019 Migration request ID: 2738818

vocms0163/dbsmigration/dbsmigration-20190513.log:--------------------getResource-- Mon May 13 19:06:35 2019 Migration request ID: 2738818

vocms0165/dbsmigration/dbsmigration-20190513.log:--------------------getResource-- Mon May 13 19:22:41 2019 Migration request ID: 2738818

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_605-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAANROTV3YU7E3NAMLP3KBQLPVHDBBA5CNFSM4HMS5T3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVJNDIQ-23issuecomment-2D491966882&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=dSUnEeUU2Hoxb-ySCz1qRIaHgWUyFgjvPQEl2mxoM80&s=fv78Oj5xlW4DXL_F0gB-CerwcuDHYuZz_KI0a6hN1D8&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AANROTUQGQ3G7B4FYERPWJ3PVHDBBANCNFSM4HMS5T3A&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=dSUnEeUU2Hoxb-ySCz1qRIaHgWUyFgjvPQEl2mxoM80&s=PiwKxJSC_mgdNiaKeZNLzUNJ_BfazIYq8vWY2t-wVFg&e=.

yuyiguo commented 5 years ago

It looked like to me it was timeout. I am trying to reenable it.

yuyiguo commented 5 years ago

I reenable the migration. Let's see if it can be done this time.

belforte commented 5 years ago

sorry @yuyiguo I do not know what you mean by re-enable. What I did this morning was to remove existing failed migration request via removeMigration API and submit a new one, which eventually failed again. Do you mean that I should submit it again, or that you did this already ?

belforte commented 5 years ago

Tried another block, failed again:

In [60]: status = apiMig.statusMigration(migration_rqst_id=id)

In [61]: status[0] Out[61]: {'create_by': 'service@vocms0105.cern.ch', 'creation_date': 1557801649, 'last_modification_date': 1557801952, 'last_modified_by': 'service@vocms0105.cern.ch', 'migration_input': '/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM#07f3510e-6c05-4546-9d54-af9f646fc910', 'migration_request_id': 2738841, 'migration_status': 3, 'migration_url': 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader', 'retry_count': 0}

In [62]:

belforte commented 5 years ago

Looking at ASO logs, all migrations are failing today :-( Am checking past days now.

belforte commented 5 years ago

As I wrote to the users, DBS migration failures has been 'unheard of' since ever, so we do not have a counter or anything. I have to grep log files like zgrep -i migrat stderr.log-20190506.gz|grep "Migration id"|cut -d: -f 9|grep terminally . So sorry if I do not paste here a way for others to reproduce, but here's a count by day of number of migration requests ended with given status. I only count terminal statuses. I only have logs since Apr 24.

day       success    fail
Apr 24     73         0
Apr 25    326         0
Apr 26   1280         0
Apr 27     38         0
Apr 28     22         0
Apr 29    139         0
Apr 30    260         1
May  1    191         2
May  2     86         3
May  3    179         4
May  4    201         8
May  5     40         3
May  6    218         3
May  7     94         2
May  8      0        40
May  9      3       392
May 10      0       413
May 11      0       840
May 12      0       326
May 13      0       352
May 14      1       297
belforte commented 5 years ago

I have submitted a new migration request after the rollback, let's see:

In [68]: status = apiMig.statusMigration(migration_rqst_id=id)

In [69]: status[0] Out[69]: {'create_by': 'service@vocms0105.cern.ch', 'creation_date': 1557837005, 'last_modification_date': 1557837005, 'last_modified_by': 'service@vocms0105.cern.ch', 'migration_input': '/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM#e38bd65b-5772-4b06-b0ce-a646d9a93fb5', 'migration_request_id': 2738882, 'migration_status': 0, 'migration_url': 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader', 'retry_count': None}

yuyiguo commented 5 years ago

Sorry Stefano I did not explain what I meant by "reenabled". It is that I changed the database status from 9 to 0 in the db. So that migration requests can be reprocessed by the migration server. 9 means permanently failed.

belforte commented 5 years ago

OK. Not something I have an API for, anyhow should be the same as the migration delete + a new request which I did. Alas the new request which I posted earlier failed: In [73]: status[0] Out[73]: {'create_by': 'service@vocms0105.cern.ch', 'creation_date': 1557837005, 'last_modification_date': 1557839875, 'last_modified_by': 'service@vocms0105.cern.ch', 'migration_input': '/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM#e38bd65b-5772-4b06-b0ce-a646d9a93fb5', 'migration_request_id': 2738882, 'migration_status': 3, 'migration_url': 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader', 'retry_count': 1}

We can wait for the 3 retries, but it dos not look promising. Back to: are there logs to checke anywhere ? Maybe I should look at Phys03Writer ?

yuyiguo commented 5 years ago

I reenabled yesterday's migration again. The problem is that DBS global reader is in high load. When we migration from global to phys03, the migration server will use DBS API to read the block from global just like anyone else. If the block is big, it will be timeout .

yuyiguo commented 5 years ago

It failed again.

belforte commented 5 years ago

I can understand that, but would like to see confirmation in logs of this timeout. I can look at block size but maybe we need to also include parents. OTOH load should not be very high now, is it ?

yuyiguo commented 5 years ago

The thing we need to figure out is which blockdump call is from migration server of phys03 in dbsGlobalReader logs. Let me check.

yuyiguo commented 5 years ago

I am working on it. Please don't restart or delete these two migrations.

belforte commented 5 years ago

OK. thanks.

belforte commented 5 years ago

let me know if I can help

yuyiguo commented 5 years ago

INFO:cherrypy.access:[14/May/2019:16:12:06] vocms0163.cern.ch 2001:1458:201:a8::100:1a9 "GET /dbs/prod/global/DBSReader/blockdump?block_name=%2FHIMinimumBias9%2FHIRun2018A-v1%2FRAW%232ad44aae-509b-4554-be3f-6802f1d66056 HTTP/1.1" 500 Internal Server Error [data: - in 135 out 200100579 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=lijimene/CN=817301/CN=Lina Marcela Jimenez Becerra" "" ] [ref: "" "PycURL/7.19.3 libcurl/7.35.0 OpenSSL/1.0.1r zlib/1.2.8 c-ares/1.10.0" ]

IS this in the time frame you were doing?

belforte commented 5 years ago

no. THis must be a test from Lina

belforte commented 5 years ago

ah sorry.... ignore. I misread the question.

belforte commented 5 years ago

last example I posted here is create_by': 'service@vocms0105.cern.ch', 'creation_date': 1557837005, that date is: Tue May 14 14:30:05 CEST 2019

there musts be a place in hell for those who write timestamps in logs w/o a timezone

yuyiguo commented 5 years ago

the 500 error may related to the redeployment script. See here https://gitlab.cern.ch/cms-http-group/doc/issues/155#note_2584183

belforte commented 5 years ago

i submitted the migration request well after DBS was restarted this morning.

belforte commented 5 years ago

Migration status fro ASO Publisher today: 0 successes 129 failures. There were 31 failures 2 hours ago.

belforte commented 5 years ago

OK. I understand now your point about https://gitlab.cern.ch/cms-http-group/doc/issues/155#note_2584183 . Let's see how it goes once that is solved. Thanks

yuyiguo commented 5 years ago

Yes, let's wait for Lina to restart DBS with the new configuration. Then I will reenable one of the migration and see if we can get it going.

yuyiguo commented 5 years ago

It did not look good, see below. The migration server got timeout while download the block from the source/DBSGlobalReader. This is after the last restart at 15:00 GMT.

Tue May 14 15:47:32 2019HTTP Error 502: Proxy Error
--------------------getResource--  Tue May 14 15:47:37 2019 Migration request ID: 2738882

Tue May 14 15:57:43 2019HTTP Error 502: Proxy Error
--------------------getResource--  Tue May 14 15:57:48 2019 Migration request ID: 2738882

Tue May 14 16:07:56 2019HTTP Error 502: Proxy Error
--------------------getResource--  Tue May 14 16:08:01 2019 Migration request ID: 2738882

Tue May 14 16:18:07 2019HTTP Error 502: Proxy Error
--------------------getResource--  Tue May 14 16:18:12 2019 Migration request ID: 2738882
vkuznet commented 5 years ago

How big is a block, can you manually query for it from DBS client and see if you'll get the info and/or timeout. Does timeout happens because of the size of the block is too large or it is part of the problems we are investigating?

On 0, Yuyi Guo notifications@github.com wrote:

It did not look good, see below. The migration server got timeout while download the block from the source/DBSGlobalReader. This is after the last restart at 15:00 GMT.

Tue May 14 15:47:32 2019HTTP Error 502: Proxy Error --------------------getResource-- Tue May 14 15:47:37 2019 Migration request ID: 2738882

Tue May 14 15:57:43 2019HTTP Error 502: Proxy Error --------------------getResource-- Tue May 14 15:57:48 2019 Migration request ID: 2738882

Tue May 14 16:07:56 2019HTTP Error 502: Proxy Error --------------------getResource-- Tue May 14 16:08:01 2019 Migration request ID: 2738882

Tue May 14 16:18:07 2019HTTP Error 502: Proxy Error --------------------getResource-- Tue May 14 16:18:12 2019 Migration request ID: 2738882

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/605#issuecomment-492328883

yuyiguo commented 5 years ago

Valenin, I did manually blockdump it and got 502 error too. The block size is 515561270 not sure the unit, I guess is byte. I am searching the front end logs and report back .

yuyiguo commented 5 years ago

I could not log in to the FEs. I will check it on vocms055 after 30 minutes.

yuyiguo commented 5 years ago
[yuyi@vocms055 srv-logs 1030] grep blockdump  */frontend/access_log_20190514.txt | grep yuyi
vocms0158/frontend/access_log_20190514.txt:[14/May/2019:19:22:08 +0200] cmsweb.cern.ch 2001:1458:201:e4::100:6a1 "GET /dbs/prod/global/DBSReader//blockdump?block_name=%2FNonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8%2FRunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1%2FAODSIM%23e38bd65b-5772-4b06-b0ce-a646d9a93fb5 HTTP/1.1" 502 [data: 7202 in 15984 out 479 body 300020478 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yuyi/CN=639751/CN=Yuyi Guo/CN=proxy" "-" ] [ref: "-" "DBSClient/3.8.1.1/" ]

502 [data: 7202 in 15984 out 479 body 300020478 us ]: what is "479 body" ? The database records the block is 515561270, but here 15984 out, how to explain?

belforte commented 5 years ago

It is hard to imagine that all of a sudden all blocks that users try to migrate have got over some size threshold. 50MB for a block does sound "offensive" though. OTH DBS Reader servers are keeping suffering.

vkuznet commented 5 years ago

Yuyi, I mentioned this already many times. The apache log info comes from [1] and according to apache log config [2] the value in front of body is Size of response in bytes, excluding HTTP headers.

Best, Valentin.

[1] https://github.com/dmwm/apache-conf/blob/master/mkserver#L262 [2] https://httpd.apache.org/docs/2.4/mod/mod_log_config.html

On 0, Yuyi Guo notifications@github.com wrote:

[yuyi@vocms055 srv-logs 1030] grep blockdump  */frontend/access_log_20190514.txt | grep yuyi
vocms0158/frontend/access_log_20190514.txt:[14/May/2019:19:22:08 +0200] cmsweb.cern.ch 2001:1458:201:e4::100:6a1 "GET /dbs/prod/global/DBSReader//blockdump?block_name=%2FNonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8%2FRunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1%2FAODSIM%23e38bd65b-5772-4b06-b0ce-a646d9a93fb5 HTTP/1.1" 502 [data: 7202 in 15984 out 479 body 300020478 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yuyi/CN=639751/CN=Yuyi Guo/CN=proxy" "-" ] [ref: "-" "DBSClient/3.8.1.1/" ]

502 [data: 7202 in 15984 out 479 body 300020478 us ]: what is "479 body" ? The database records the block is 515561270, but here 15984 out, how to explain?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/605#issuecomment-492345984

belforte commented 5 years ago

something like connection timed out when only 15KB out of 50MB where sent to client ?

belforte commented 5 years ago

should be possible to send 50MB to another service running in same cluster at CERN in 5min though.

yuyiguo commented 5 years ago

Thanks Valentin.

It is really odd that only 479 bytes client data was sending out. Where was the time used? From the Stefano's crab log analysis we knew this happened right after May 7th, but we already rollbacked DBS.

belforte commented 5 years ago

so the blockdump query has 14KBytes IN ??? it should be only the block name !

belforte commented 5 years ago

I looked up that block via DAS https://cmsweb.cern.ch/das/request?view=list&limit=50&instance=prod%2Fglobal&input=block+block%3D%2FNonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8%2FRunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1%2FAODSIM%23e38bd65b-5772-4b06-b0ce-a646d9a93fb5

It has only one file which has only two parents. https://cmsweb.cern.ch/das/request?instance=prod/global&input=parent+file%3D%2Fstore%2Fmc%2FRunIILowPUAutumn18DR%2FNonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8%2FAODSIM%2F102X_upgrade2018_realistic_v15-v1%2F60000%2F90DA116C-6387-504E-B922-7BA697DF2BC8.root How can it have a 50MB blockdump ? I never used blockdump API, so am not sure what to expect. those two files are GEN-SIM so should have no further parentage. The number of lumis in those two files also look very sane, e.g. https://cmsweb.cern.ch/das/request?instance=prod/global&input=lumi+file%3D%2Fstore%2Fmc%2FRunIILowPUAutumn18GS%2FNonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8%2FGEN-SIM%2F102X_upgrade2018_realistic_v11-v1%2F60000%2F459B92E8-3A71-1349-A5C6-9DB374CD2D64.root

belforte commented 5 years ago

since I couldn't believe my eyes, I decided to try myself, and sure enough In [27]: b Out[27]: '/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM#e38bd65b-5772-4b06-b0ce-a646d9a93fb5'

In [28]: dump=apiG.blockDump(block_name=b)

gets stuck until HTTPError: HTTP Error 502: Proxy Error

I am watching Grafana in case I trigger some wide damage...

belforte commented 5 years ago

for fun.. I tried a fully different dataset, the small relval-like used for HammerCloud In [30]: dsHC='/GenericTTbar/HC-CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/AODSIM'

In [31]: blHC=[]

In [32]: for dic in apiG.listBlocks(dataset=dsHC): ....: blHC.append(dic['block_name']) ....:

In [33]: len(blHC) Out[33]: 2

In [34]: blHC[0] Out[34]: '/GenericTTbar/HC-CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/AODSIM#3517e1b6-76e3-11e7-a0c8-02163e00d7b3'

In [35]: dump=apiG.blockDump(block_name=blHC[0])

same result: HTTPError: HTTP Error 502: Proxy Error

I'd rather not try more blocks ! But this does not look good at all. DBS is otherwise performing finely and other queries like list of blocks, files, etc. works great.

If anything, I can see some blips in the FDs and connections plots more or less at the time of my tests, but it may well be fully unrelated.

Did something change inside this blockDump ?

belforte commented 5 years ago

is anybody but the migrationserver using blockDump ?

yuyiguo commented 5 years ago

blockDump is only used by migration. This API has not been changed for years.

belforte commented 5 years ago

yet, it does not work anymore. At this point it is hard to think that everytime we try it times out because something else is keeping server stuck, it is more likely the other way around. We need to find out what has changed.

Is there confusion in the DB ? How did you came up with 50MB for that block with just one file ?

On 14/05/2019 22:46, Yuyi Guo wrote:

blockDump is only used by migration. This API has not been changed for years.

yuyiguo commented 5 years ago

You meant "The block size is 515561270 not sure the unit, I guess is byte." ? This is ~500MB. It is from DBS block table.

belforte commented 5 years ago

oopp.. I can't count. So we have no reason for this timing out :-(

On 14/05/2019 23:01, Yuyi Guo wrote:

You meant "The block size is 515561270 not sure the unit, I guess is byte." ? This is ~500MB. It is very DBS block table.

vkuznet commented 5 years ago

Well, can we identify this block and reproduce DBS behavior in VM?

Here are two scenarios:

In either case I suggest to take one example and provide benchmark numbers in order to decide what to do. I doubt that we should increase 5min timeout on FE since it is already TOO MUCH for any HTTP server. And if such blocks cause DBS RAM rump-up we should revisit data-format, and block management to address this issue.

On 0, Yuyi Guo notifications@github.com wrote:

You meant "The block size is 515561270 not sure the unit, I guess is byte." ? This is ~500MB. It is very DBS block table.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/605#issuecomment-492494563

yuyiguo commented 5 years ago

Stefano, Valentin,

Looking into the fds plot, it was stable after Stefano turned off the migration yesterday around 4pm Fnal time(my estimated time) . Crab may be the only one is using the migration.