Closed belforte closed 4 years ago
sure bet ! Who else would migrate datasets from Global to Phys03 ? (hey.. it was not my idea to begin with ! )
On 15/05/2019 07:58, Yuyi Guo wrote:
Crab may be the only one is using the migration.
I turned it off right before I posted the elog and sent mail, about 6:30 pm FNAL time. Of course there could have been little to none migration attempts in the previous hours. As reported, there were hundreds of failed migrations request, but only a handful of DBS crashes. It is not as direct as one call - server dies !
On 15/05/2019 07:58, Yuyi Guo wrote:
Stefano turned off the migration yesterday around 4pm Fnal
I could not reproduce the blockDump timeout using my VM test stand. I am going to connect my VM to DBS prod global DB in order to reproduce the 502 error. It is kind risky. But there is no good way to do it. What you think? @belforte @amaltaro
Apologies for not reading the whole thread. Do we have that data in testbed? AFAIK preprod DBS instance (int) data was replaced by a full dump of the production one, and it wasn't too long ago (maybe half an year). Otherwise the safest would be to get a dump of prod and try to fit it in your dev oracle instance.
Connecting a VM to the production database would be the last resort IMO. You know it better than anyone, but I'll say it anyways :-) If you do so, make sure to disable any possible cronjobs and etc that might write to DBS :-)
int haven't been updated from prod for two years or so. I asked DBA to import from prod two weeks ago, but we hold it because the import may take a day or longer and there is no guarantee all the permission working right way.
I only connect the reader to prod, but I am afraid that dump could slowdown the dataset access to cause another storm. Let me see if I can find a big block in the int.
Don't be too shy Yuyi, we have been making this blocvkDump request to DB msany times already ! Go for it. OTOH. I really do not get wh you say it is a big block. If you refer to the one I posted, it is one file, two parent files, less than 100 lumis ! AFAIK ALL blocks fail to blockDump, I tried one from ther HammerCloud dataset, and that also fails. That the query works on your test VM on your test DB.. was clearly expected. We need to findf what's different in prodcution lately, I would be suprised if your VM works when connected to Oracle. WSe may need to add diagnsotic printout in DBS server iin CMSWEB... but let's hope I am worng
@yuyiguo just to check, have you deployed exactly the same version that is now in CMSWEB? Just to make sure you have the same set of dependencies as in CMSWEB.
@belforte @amaltaro My VM has the slight different version than what in production. Maybe I should redeploy it to 3.7.8? I used the int db with a blcok: /TestRepacker50Streams-A/Online/RAW#1712d5d4-d03f-43d1-a187-2179a25ac75c, size 117175781694389 and file_count 31999. I got 502 error.
When I blockdumped a small block generated by my unit tests, It was successful. That is why I want to do the exact block that failed for the migration.
OTOH, if you can tell exactly what blockDump is doing, maybe you can figure out the informations that it is fetching from Oracole, and get them sort of 'one pievce at a time' via current API's and rebuild the exact payload, if the problem is payload size. Sorry that I can't say more, until yesxterday I did not even know that blockDump exist
size 11717578169438 ??????? what are those units ? even if it were bit it makes little sense
That is in the block table. I guess it is byte. This is the real block size, not the DSB dump size. lasted modified by /DC=ch/DC=cern/OU=computers/CN=tier0/lxgate39.cern.ch in 2009.
Again.. here' s a block whichc fails to dump, and is small from all metrics I can think https://github.com/dmwm/DBS/issues/605#issuecomment-492425495 Why do you keep saying that there is a 50MB (500MB?) monster around ? what is blockDump doing ? does it lookup the whole dataset ? DOes it try to transfer the full Orable table ?
This is the real block size, not the DSB dump size.
So you mean the sum of the file sizes of all files in the block. Thas's OK to be TB's and it is totally irrelevant, just one number in the DB !
blockdump dump a block. That include a completed information if I need to reinsert it back to DBS, for example dataset name, files in the block. The big block size usually means more files and lumis.
Indeed 30k files may be a lot for a block. But we not looking for the pathological cases here, which may very well result in timeout. We are looking for something that makes thing fails for normal queries. I'd like you to check the block https://github.com/dmwm/DBS/issues/605#issuecomment-492425495 which only has 2 files. Is that possible ?
I'd love to connect your VM to Oracle DBSReader and try that. To be clear.
Yes. I tried that and I got the whole block dumped quickly. I used the block_name="/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM#e38bd65b-5772-4b06-b0ce-a646d9a93fb5" that was the first block need to be migrated that two examples of failed crab migration.
Now, I switched to DBS global reader to dump the same block and it still working on it.
Ok, just done. HTTP Error 502: Proxy Error I am going to wrapped off my VM and install the 3.7.8 as the production rollback. But my feeling is that would be the same as I saw now.
sigh..... so we know that it is not as simple as a code bug, which was known in any case since as you point out code did not change in years.
can you find this query that just failed in server logs somewhere, somehow and tell if it was stuck in there, or communicsting with FE, or did not even reach server ? I am not sure 502 is timeout. Seems the server replied with some error.
did you try cmsweb-testbed ?
I did not try cmsweb-testbed .
I picked the first block of the first MINIAOD dataset in testbed. ANd... miracle, I got the 502 again. I think this is great, we can log on testbed, hack code in vivo and add diagnostics at leisure ! In [31]: apiT.serverinfo() Out[31]: {'dbs_instance': 'int/global', 'dbs_version': '3.7.8-comp2'}
In [29]: block Out[29]: '/JetHT/CMSSW_7_3_0_pre3-GR_R_73_V0A_RelVal_jet2012D-v1/MINIAOD#f8092cc8-7684-11e4-86df-003048c9c3fe'
HTTPError Traceback (most recent call last)
OTOH, if you reproecue this with 3.7.8 onyour VM, even better.
@belforte When you tested on cmsweb-testbed, what is you DBS Url? if it was prod global reader, then we did not reproduce the problem yet.
In [35]: apiT.dict Out[35]: {'cert': None, 'http_response': <RestClient.RequestHandling.HTTPResponse.HTTPResponse at 0x7f6ad4416b50>, 'key': None, 'proxy': None, 'rest_api': <RestClient.RestApi.RestApi at 0x7f6ad46b6790>, 'url': 'https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader', 'userAgent': ''}
In [36]:
IIUC this is the GlobalReader DBS Server in cmsweb-testbed connected to int instance of Oracle DB.
I listed all MINIAOD's and picked first in list '/JetHT/CMSSW_7_3_0_pre3-GR_R_73_V0A_RelVal_jet2012D-v1/MINIAOD' it has two blocks. I picked the first one.
I could not reproduce the blockdump timeout problem.
I redeployed my VM with dbs 3.7.8 that was built with HG1902a on Jan 22. I did not rebuild because it is in repository already and that was I used to do my validation before the Feb release. This is the closed version we used on prod before the May release.
In order to test the same blocks failed in the migration, I connected to prod global database with my VM's reader. Then I ran the blockdump API. The output is in the attached file. The first line is the time used to run the API. The first block is about 0.09s. This may due to that I ran the same block a few times. Then the second block was the same block that Stefano reported failed on "cmsweb-testbed". It took 1.17s.
However if I connect to prod global server("https://cmsweb.cern.ch/dbs/prod/global/DBSReader/") with the same client to blockdump these two blocks after I successfully did with my VM server, I got 502 error.
So it is to me that here may be something wrong or misconfigured in cmsweb.
It makes a lot of sense to me that this is not a clean DBS code bug but something in the interaction of BackEnd, FronEnd and specifics of CMSWEB deployment. I'd say that we are lucky that problem is reproducible in cmsweb-testbed, let's tear it apart and find out.
side note: indeed the format of the list of lumis in those blockdumps is horribly verbose, but we can worry about that later.
Yuyi, are you following up in cmsweb testbed, or do you think that we need to put someone else on this now ? I found my query in testbed FE [1] but I do not find the same query in DBS logs, even if I see the other ones which I made. Maybe we can hack DBS there to add some logging ? Or turn off other services there to look for inteferences ?
[1] [15/May/2019:22:16:26 +0200] cmsweb-testbed.cern.ch 188.184.30.30 "GET /dbs/int/global/DBSReader/blockdump?block_name=%2FJetHT%2FCMSSW_7_3_0_pre3-GR_R_73_V0A_RelVal_jet2012D-v1%2FMINIAOD%23f8092cc8-7684-11e4-86df-003048c9c3fe HTTP/1.1" 502 [data: 6933 in 15988 out 475 body 300099468 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Stefano Belforte belforte@infn.it/CN=333465570" "-" ] [ref: "-" "DBSClient/Unknown/" ]
Yuyi, there is a big difference in blockdumps: -rw-r--r--@ 1 vk staff 47K May 15 17:43 /Users/vk/Downloads/blockdump.txt -rw-r--r--@ 1 vk staff 125K May 15 17:43 /Users/vk/Downloads/blockdump2.txt
Can you dump it as JSON, i.e. instead of print(doc) do print(json.dumps(doc)).
Then we can try to load it into python to check the memory allocation.
On 0, Yuyi Guo notifications@github.com wrote:
I could not reproduce the blockdump timeout problem.
I redeployed my VM with dbs 3.7.8 that was built with HG1902a on Jan 22. I did not rebuild because it is in repository already and that was I used to do my validation before the Feb release. This is the closed version we used on prod before the May release.
In order to test the same blocks failed in the migration, I connected to prod global database with my VM's reader. Then I ran the blockdump API. The output is in the attached file. The first line is the time used to run the API. The first block is about 0.09s. This may due to that I ran the same block a few times. Then the second block was the same block that Stefano reported failed on "cmsweb-testbed". It took 1.17s.
However if I connect to prod global server("https://cmsweb.cern.ch/dbs/prod/global/DBSReader/") with the same client to blockdump these two blocks after I successfully did with my VM server, I got 502 error.
So it is to me that here may be something wrong or misconfigured in cmsweb.
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/605#issuecomment-492826199
@vkuznet IMHO you are off-track
maybe also cut down testbed to a single FE and BE server, to make it easier to know which log file to look at ?
Anyhow, someone has to dissect cmsweb(-testbed) to find out what's going on. I have not heard of any volunteer. Waiting for Lina will not do. Who is in charge ?
@yuyiguo looking in testbed log, I found that at one time blockdump worked for you [1] in the server, yet all calls logged in FE have HTTP error 502. [2] and are from several hours before. What did you do On May 13 at 20:35 to bypass FE and get a successful call ?
[1] vocms0132/dbs/DBSGlobalReader-20190513.log:INFO:cherrypy.access:[13/May/2019:20:35:26] vocms0132.cern.ch 188.184.94.167 "GET /dbs/int/global/DBSReader/blockdump?block_name=%2Funittest_web_primary_ds_name_63599%2Facq_era_63599-v3605%2FGEN-SIM-RAW%2363599 HTTP/1.1" 200 OK [data: - in 13110 out 20179768500 us ] [auth: OK "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yuyi/CN=639751/CN=Yuyi Guo" "" ] [ref: "" "DBSClient/3.8.1.1/" ]
[2] belforte@vocms055/srv-logs> grep blockdump vocms0[7,1]3[4,5]/frontend/*|grep unittest grep: vocms0135/frontend/follow-up-cric: Is a directory grep: vocms0734/frontend/access_log_201808: Is a directory vocms0135/frontend/access_log_20190513.txt:[13/May/2019:14:59:06 +0200] cmsweb-testbed.cern.ch 188.184.94.167 "GET /dbs/int/global/DBSReader/blockdump?block_name=%2Funittest_web_primary_ds_name_63599%2Facq_era_63599-v3605%2FGEN-SIM-RAW%2363599 HTTP/1.1" 502 [data: 7144 in 15988 out 475 body 300102039 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yuyi/CN=639751/CN=Yuyi Guo/CN=proxy" "-" ] [ref: "-" "DBSClient/3.8.1.1/" ] vocms0135/frontend/access_log_20190513.txt:[13/May/2019:16:03:44 +0200] cmsweb-testbed.cern.ch 188.184.94.167 "GET /dbs/int/global/DBSWriter/blockdump?block_name=%2Funittest_web_primary_ds_name_49393%2FAcq_Era_49393-unittest_web_dataset-v558%2FRECO%237c16ee0e-98d4-4f1e-8e49-8dd380431fa4 HTTP/1.1" 502 [data: 376 in 680 out 475 body 300095185 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yuyi/CN=639751/CN=Yuyi Guo/CN=proxy" "-" ] [ref: "-" "DBSClient/3.8.1.1/" ] vocms0135/frontend/access_log_20190513.txt:[13/May/2019:16:10:51 +0200] cmsweb-testbed.cern.ch 188.184.94.167 "GET /dbs/int/global/DBSWriter/blockdump?block_name=%2Funittest_web_primary_ds_name_11465%2FAcq_Era_11465-unittest_web_dataset-v780%2FSIM%2319ed75f1-44bc-4801-9e83-05084e56d7bc HTTP/1.1" 502 [data: 375 in 680 out 475 body 300101539 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yuyi/CN=639751/CN=Yuyi Guo/CN=proxy" "-" ] [ref: "-" "DBSClient/3.8.1.1/" ] grep: vocms0734/frontend/error_log_201808: Is a directory vocms0734/frontend/access_log_20190513.txt:[13/May/2019:15:05:28 +0200] cmsweb-testbed.cern.ch 188.184.94.167 "GET /dbs/int/global/DBSWriter/blockdump?block_name=%2Funittest_web_primary_ds_name_7987%2FAcq_Era_7987-unittest_web_dataset-v400%2FSIM%23180077ac-e958-49f6-b881-ea6f18ad4fb0 HTTP/1.1" 502 [data: 373 in 680 out 475 body 300023995 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yuyi/CN=639751/CN=Yuyi Guo/CN=proxy" "-" ] [ref: "-" "DBSClient/3.8.1.1/" ] vocms0734/frontend/access_log_20190513.txt:[13/May/2019:15:55:48 +0200] cmsweb-testbed.cern.ch 188.184.94.167 "GET /dbs/int/global/DBSWriter/blockdump?block_name=%2Funittest_web_primary_ds_name_29804%2FAcq_Era_29804-unittest_web_dataset-v848%2FRAW%234171a642-e2b9-4ff5-adf8-85960202c5e1 HTTP/1.1" 502 [data: 375 in 680 out 475 body 300110965 us ] [auth: TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yuyi/CN=639751/CN=Yuyi Guo/CN=proxy" "-" ] [ref: "-" "DBSClient/3.8.1.1/" ] belforte@vocms055/srv-logs>
By the way, it is not that that block is magic, I just tried to call blockDUmp on it and returns 502 after long wait as expected. By the way, why we get 502 and not 504 Gateway Timeout ?
I redeployed my VM with dbs 3.7.8 that was built with HG1902a on Jan 22.
This is not the same environment. As I said before, using any older tag is not going to bring all the dependencies we have right now in cmsweb production. One of those is a new version of SQLAlchemy, which might be playing a trick here. Please deploy HG1905i
from the official comp
repository.
@belforte I don't remember what I did May13. Sorry. Let's see what make us to reproduce the problem in a controlled env.
@vkuznet Valentin, These are very small blocks. There was no memory problem. Let's find out why we cannot event dump the small blocks first.
@amaltaro @h4d4 Thanks Anal for the info. When we start a new version SQLAlchemy? It could be me missed the announcement. The version of SQLAlchemy and Cheerypy had a lot of impact on DBS. We saw these before.
I want to reproduce what we had in Feb release because all the problem started after the May release. This is also I did not understand why we wanted to rebuild everything and use new deployment script when we rolled back DBS. To me we would use exact what we used for the Feb release as a roolback.
why can't we use cmsweb-testbed for this, since it is same setup as production and reproduces the problem ? Just warn people to stay off, log in, and hack DBS code in /data/srv/.../lib/... to add diagnostics to log.
I am sure "rest of CMS" can live w/o testbed for a day !
The new SQLAlchemy version was introduced in the May production release :-) This upgrade was communicated through this web interface thread: https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/1661.html
Yes, the rollback only rollsback the service version itself, not the whole stack of software.
While I write this comment...
log in, and hack DBS code in /data/srv/.../lib/... to add diagnostics to log.
and this is precisely why I don't like the idea of developers having full access to the CMSWEB services, even if it's pre-prod. It's NOT an environment "free-for-all" where you do your stuff and test as you like. It's an environment primarily for Lina to test CMSWEB changes, then for developers to test their services. If you want to change the testbed code, you better make a PR and ask Lina to apply it. Or if you apply it, you know exactly what changes were made and you can easily revert them. Otherwise we will start bugging Lina "redeploy testbed because I messed up with DBS" and so on.
Why do I have to be grumpy about it... (!)
@amaltaro I do not know why you have to be grumpy. It is good to have rules and follow them. But extraordinary problems call for extraordinary measures. I am not saying "do this at whenever you have a itch". But Lina will do a new testbed deployment in a week or so for next dev. cycle anyhow. CMS can live w/o testbed until then. Maybe you would look at things differently if it was production being affected instead of CRAB Publication ? Which exact activity will suffer if we take over testbed today and give it back to Lina on Monday for a full reset ? Anyhow, I agree that if problem can be reproduced with all services in a single VM, so we know that details of cmsweb config. are not relevant, so much the better. But if we look at something in the FE/BE interaction, we may need the testbed.
That's a valid point, Yuyi! Honestly speaking, I'm running different types of WMCore tests against testbed in almost a daily basis. If we have an issue with the production system, I need to create a fix and properly test it in the testbed setup, for that sometimes I cannot wait more than an afternoon of testbed unavailability.
Having said that, if it's well coordinated and communicated, I think the WMCore team can survive a few days without testbed. But we need to have a well defined time window for such intervention.
And I just remembered that CMS@Home / Opportunistic computing also relies on cmsweb-testbed, that's why we should of course avoid breaking it as much as possible.
@amaltaro @h4d4 @belforte
Alan, I definitely missed the email of SQLAlchemy. When the core software updating needs , we need to have everyone on board. Email may not be enough. I am not blaming anyone for the missing communication, but for a lesson learnt for the future. Maybe Lina could put the announcement/reminder of this kind change in the release notice/schedule as she did for every CMSWEB release.
Alan; Lina: I really disagree that rollback means "Yes, the rollback only rollsback the service version itself, not the whole stack of software." . To me , rollback means that we bring entire DBS system to the previous DBS before the recently release . Only this way we could understand if the problem was due to the new deployment or due to the outside changing, such as user access pattern changes. I asked more than once what was the difference between HG1902 and HG1905. I was assured that the only difference was where we got the tnsanems.ora, nothing else. That is why I did not understand why we sill saw the exact same problem after rollback. I was thinking the usage pattern changing caused the problem. When Stefano stopped Crab migration and we found all blockdump got 502 error, my understanding of the problem shifted.
Stefano, I would like to do this on my VM because that I could understand what was different between the Feb and May release so we understand the problem better. I may have to run server side debug in order to trace the code line by line. I could not do this on cmsweb-testbed. Yuyi
OK, Going to wrap off my VM and redeploy it with HG05i. Report back soon.
@yuyiguo @amaltaro @belforte Yuyi,
If you need to test tag HG1902 on testbed I think that you could test it between today and before Monday. The reason because I have to build a completely new tag, it is because a tag includes not only DBS specs, it includes all cmsweb spec services and I have to rebuild it. Please let me know, I can deploy that tag in testbed. Just keep in mind that other services there in testbed will be in feb version. Just let me know
Sorry to add more problems. There have been a few, persistent, failures to migrate datasets from Global to Phys03. While it makes surely sense that migration can fail when Global can't be read, I'd like to see what exactly went wrong, if nothing else to know when to try again.
I lookeed at DBSmigration logs via vocms055, but they have no useful information, only a series of time stamps. Even when grepping for a given (failed) migration request id, I found only one line liting the Id, but no detail.
Is there maybe a verbosity level that could be changed, temporarely ?
examples of Dataset which failed to migrate:
/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM
/DYJetsToLL_M-50_TuneCUETHS1_13TeV-madgraphMLM-herwigpp/RunIISummer16MiniAODv3-PUMoriond17_94X_mcRun2_asymptotic_v3-v2/MINIAODSIM