WMAgent fails to inject large block into DBS

amaltaro commented 2 years ago

While looking into the logs for one of the WMAgents being drained, I noticed that there are a few blocks failing to be inserted into DBS and the error message is [1], so apparently it fails inserting the file / lumi information.

I got a json dump for one of those blocks and it's pretty big, with basically 200k lumis. I share that dump here such that your debugging is easier: https://amaltaro.web.cern.ch/amaltaro/forValentin/f0018dfc-3045-4923-b30b-e427f5384066.json

@vkuznet could you please have a look into this? FYI @todor-ivanov

[1]

2022-07-19 12:27:15,469:140298175842048:ERROR:DBSUploadPoller:Error trying to process block /SingleNeutrino/RunIISummer20UL18MiniAODv2-EpsilonPU_106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM#f0018dfc-3045-4923-b30b-e427f5384066 through DBS. Error: [{"error":{"reason":"DBSError Code:110 Description:DBS DB insert record error Function:dbs.filelumis.insertFLChunk Message:fail to inject FL chunk Error: nil","message":"20f4620ad53d7b12eccd682182c44fe6faa1baeba40d3822132595e5a3798a22 unable to insert FileLumis records for /store/mc/RunIISummer20UL18MiniAODv2/SingleNeutrino/MINIAODSIM/EpsilonPU_106X_upgrade2018_realistic_v16_L1v1-v2/70000/3B9B1088-97D6-D048-9CF7-91D5EE5624B8.root, fileID 1603358597, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.filelumis.insertFLChunk Message:fail to inject FL chunk Error: nil","function":"dbs.bulkblocks.InsertBulkBlocksConcurrently","code":110},"http":{"method":"POST","code":400,"timestamp":"2022-07-19 17:27:14.997530162 +0000 UTC m=+1824778.380206942","path":"/dbs/prod/global/DBSWriter/bulkblocks","user_agent":"DBSClient/Unknown/","x_forwarded_host":"dbs-prod.cern.ch","x_forwarded_for":"188.184.31.220","remote_addr":"10.100.168.64:60968"},"exception":400,"type":"HTTPError","message":"DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:20f4620ad53d7b12eccd682182c44fe6faa1baeba40d3822132595e5a3798a22 unable to insert FileLumis records for /store/mc/RunIISummer20UL18MiniAODv2/SingleNeutrino/MINIAODSIM/EpsilonPU_106X_upgrade2018_realistic_v16_L1v1-v2/70000/3B9B1088-97D6-D048-9CF7-91D5EE5624B8.root, fileID 1603358597, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.filelumis.insertFLChunk Message:fail to inject FL chunk Error: nil Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.filelumis.insertFLChunk Message:fail to inject FL chunk Error: nil"}]

vkuznet commented 2 years ago

Alan, your input is not sufficient to debug the issue. Please provide input JSON that I can try it out (since it is big one, gzip it and put somewhere on lxplus such that I can access it, i.e. do not post it here). Does the failure happen because JSON part was malformed, or because it was timeout, or etc. Once I see your JSON input I can try to inject it into my dev DB and see the actual error.

amaltaro commented 2 years ago

Sorry Valentin, I forgot posting a link to the json dump. I updated the initial description with it now.

UPDATE: it takes around 1min to process that request until it fails.

belforte commented 2 years ago

seems the known FL chunk error discussed in https://github.com/dmwm/dbs2go/issues/53 Best of my knowledge the fix was not deployed in production yet, but it appears fully solved in new code.

vkuznet commented 2 years ago

Alan, even though it seems as duplicate of #53 ,the actual issue here is FE timeout. The request to insert 200K lumis takes more than 5 minutes in ORACLE on testbed cluster (int2r DB). It means that the actual issue here is amount of data and we need to properly adjust concurrency settings. Currently, I have two parameters which controls injection into filelumis table:

file_lumi_max_size which controls maximum size of number of records to insert trough one concurrent iteration, the default value is 10000
file_lumi_chunk_size which controls number of records within concurrent chunk, the default value is 500

In other words, when the input is 200K, we will have 20 iterations (of size 10K) where within each iterations the 10K records will be split into 20 concurrent goroutines (20x500=10K). WIth this settings the time required to insert 200K lumis exceed 5min and FE times out. When I switched these parameters to file_lumi_max_size=100K file_lumi_chunk_size=500 then I got the following injection time:

1m13sec for 100K records
4m27sec for 200K records (your initial input) This implies that we'll need to fine tune DBS configuration to increase concurrency in order to fit within FE allowed range (5min).

I plan to upgrade DBS servers to newer version today which in fact fixes #53 but in order to fix this issue with large number of lumis I bet we need more testing to find optimal set of parameters for concurrency in DBS server. The problem is that even after I changed settings in my dev server your json failed to be inserted due to missing dataset into. I need to have in my DB all stuff which will be sufficient to insert a block (the error I got complained that it is unable to find dataset_id for /SingleNeutrino/RunIISummer20UL18RECO-EpsilonPU_106X_upgrade2018_realistic_v11_L1v1-v2/AODSIM).

So, my suggestion is the following:

if you can generate a json with sufficient information about configs/dataset/etc meta-data which will contain large lumis, e.g. 200K, I can try it out in stand-alone test to identify proper concurrency settings. So far the json you gave me is not sufficient as my DB lacks of dataset info for this block.
you should carefully evaluate the amount of lumis we put per single block and identify a limit, i.e due to growth of ORACLE tables (FILE_LUMIS table is already have almost 9B records in production DBS DB) sooner or later we will start hitting time outs

amaltaro commented 2 years ago

Thanks for this investigation, Valentin.

The agent component logs (fresh logs pasted here [2]) do not suggest any frontend timeout. As you can see from the timestamps, it took less than a minute between starting the inject call and getting the error back from the web service.

Fine tuning these concurrency parameters will be beneficial, even though I think it's unrelated to the issue I am reporting here. Some time ago I shared this extremely large block with Yuyi: https://amaltaro.web.cern.ch/amaltaro/forYuyi/bad_block.json

you might want to use it for the tuning investigation.

[2]

2022-07-20 10:11:29,899:140103223129856:INFO:DBSUploadPoller:Queueing block for insertion: /SingleNeutrino/RunIISummer20UL18MiniAODv2-EpsilonPU_106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM#f0018dfc-3045-4923-b30b-e427f5384066
2022-07-20 10:11:30,243:140103223129856:INFO:DBSUploadPoller:About to call insert block for: /SingleNeutrino/RunIISummer20UL18MiniAODv2-EpsilonPU_106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM#f0018dfc-3045-4923-b30b-e427f5384066
2022-07-20 10:12:21,733:140103223129856:ERROR:DBSUploadPoller:Error trying to process block /SingleNeutrino/RunIISummer20UL18MiniAODv2-EpsilonPU_106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM#f0018dfc-3045-4923-b30b-e427f5384066 throug
h DBS. Error: [{"error":{"reason":"DBSError Code:110 Description:DBS DB insert record error Function:dbs.filelumis.insertFLChunk Message:fail to inject FL chunk Error: nil","message":"150d7f7dc5e4f8532e16d6dbd37400f56be24def99fa355a298892
ccf9064466 unable to insert FileLumis records for /store/mc/RunIISummer20UL18MiniAODv2/SingleNeutrino/MINIAODSIM/EpsilonPU_106X_upgrade2018_realistic_v16_L1v1-v2/70000/3B9B1088-97D6-D048-9CF7-91D5EE5624B8.root, fileID 1605606237, error DB
SError Code:110 Description:DBS DB insert record error Function:dbs.filelumis.insertFLChunk Message:fail to inject FL chunk Error: nil","function":"dbs.bulkblocks.InsertBulkBlocksConcurrently","code":110},"http":{"method":"POST","code":40
0,"timestamp":"2022-07-20 15:12:21.671975007 +0000 UTC m=+1903071.444823219","path":"/dbs/prod/global/DBSWriter/bulkblocks","user_agent":"DBSClient/Unknown/","x_forwarded_host":"dbs-prod.cern.ch","x_forwarded_for":"188.184.31.220","remote
_addr":"137.138.63.204:51800"},"exception":400,"type":"HTTPError","message":"DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:150d7f7dc5e4f8532e16d6dbd37400f56be24def99f
a355a298892ccf9064466 unable to insert FileLumis records for /store/mc/RunIISummer20UL18MiniAODv2/SingleNeutrino/MINIAODSIM/EpsilonPU_106X_upgrade2018_realistic_v16_L1v1-v2/70000/3B9B1088-97D6-D048-9CF7-91D5EE5624B8.root, fileID 160560623
7, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.filelumis.insertFLChunk Message:fail to inject FL chunk Error: nil Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.filelumi
s.insertFLChunk Message:fail to inject FL chunk Error: nil"}]

belforte commented 2 years ago

FWIW CRAB (obviously) hit the too many lumis in one block problem as well, but things were working after I put a limit at 1 million lumis per block, making smaller block as needed from the default of 100 files. But I still think that this stuff is pathological and we should refuse to deal with it. Back to... why do we need to store that lumilist in DBS for such datasets ? Who's going to ever read it back ?

I do not believe in "try to accommodate whatever shit people throw at you" it will only make then send uglier stuff.

vkuznet commented 2 years ago

Alan, thanks for example and info about WMA. I know for sure that production database is much faster then int2r and it may be the case when I can see time outs on testbed but they are ok on production db. Said that, I upgraded DBS servers (reader and writer) on cmsweb-prod and now we should have a fix for #53 . I understand that WMA use poling cycle, and therefore after some time we may check for this block on cmsweb-prod. Please check this block after some time and report here.

As Stefano pointed out, I strongly suggest to start imposing limits on number of lumis per block and incorporate this into WMCore/WMA/McM/PdmV workflows.

belforte commented 2 years ago

I presume that there would be no issue in storing the lumilist, even if large, as a BLOB. The problem IMHO is that we want to write it into a database table.. why ? For job splitting when asked to process one file in multiple jobs, we just use the full list of lumis in each file, do not ask Oracle to pick. Since a year (more?) I've been waiting for the "DBS review" where this could maybe be addressed, but I think it is time to have Kevin and Diego bring this up and just say "if there are more than X lumis in one file, we do not store the list as addressable memory" period.

Just an idea of course... feel free.

belforte commented 2 years ago

I mean.. at some point I may likely hit the point in CRAB where they force me to make 1-file blocks... before that I will simply stop sending lumis in such cases and wait for anybody to notice :-) I have verified that an empty lumilist is accepted (thanks Valentin!). But you in WMA/P&R land may need to tell your stakeholder in advance and raising this only when everything crashes would not be nice. Yeah.. I better shut up.

vkuznet commented 2 years ago

I created dedicated WMCore ticket for this discussion about DBS scalability and large lumis, see https://github.com/dmwm/WMCore/issues/11227

amaltaro commented 2 years ago

Whether we need lumi information in DBS or not probably goes beyond my pay grade. But I can say that lumi in DBS is also used to keep track of the correct parantage information.

Regarding the block that the agent was failing to inject: MINIAODSIM#f0018dfc-3045-4923-b30b-e427f5384066

I can confirm that it finally went through, here is the component log (FNAL timezone, thus -7 compared to CERN):

2022-07-20 11:56:51,351:140103223129856:INFO:DBSUploadPoller:Queueing block for insertion: /SingleNeutrino/RunIISummer20UL18MiniAODv2-EpsilonPU_106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM#f0018dfc-3045-4923-b30b-e427f5384066
2022-07-20 11:56:51,686:140103223129856:INFO:DBSUploadPoller:About to call insert block for: /SingleNeutrino/RunIISummer20UL18MiniAODv2-EpsilonPU_106X_upgrade2018_realistic_v16_L1v1-v2/MINIAODSIM#f0018dfc-3045-4923-b30b-e427f5384066
2022-07-20 11:57:51,655:140103223129856:INFO:BaseWorkerThread:DBSUploadPoller took 77.897 secs to execute

as one can see, it took exactly 1min to insert that information in the database.

Valentin, can you please clarify what change was made to the DBSWriter server? Was it the concurrency values that you tweaked (file_lumi_max_size vs file_lumi_chunk_size)? Thanks

belforte commented 2 years ago

Alan, don't assume that those with a pay grade above yours have a better undestanding of these matters than you do.

lumi in DBS is also used to keep track of the correct parantage information

hmm... that's true, one needs to find which parent file(s) have the correct lumis. Yet... one has to ask if this huge table in Oracle is the best solution. All in all this is something done in job preparation step where we have tons of CPU available to sort and match long lists. Yeah... need to list all use cases. I still fear that the problem comes from those crazy MC where a job produces just a handful of event in a file with thousands on fake lumis.

vkuznet commented 2 years ago

Alan, the changes to DBS servers were not related to concurrency tweaking. I fixed couple of issues, e.g. #53 , as well as added new changes to DBS migration codebase. Therefore, this issue here is duplicate of #53 where injection fails when empty FL chunk was passed to goroutine. Since your block was successfully injected using new version of DBS server I'm closing this ticket.

dmwm / dbs2go

WMAgent fails to inject large block into DBS #67