Open ggonzr opened 1 year ago
@ggonzr Hi Geovanny, apologies for the delay on getting back to this.
From what I can see, the worst offender in these job configuration files (PSet) is the list of input files, by far. To the best of my knowledge, we have basically 2 potential lists of input files. a) primary input files (which can be classified in empty, primary EDM data, and lhe files) b) secondary input files (apparently classified as premix or classic)
Other than lhe files (using LHESource), all of the other input files are overwritten when a job is getting bootstrapped in the worker node.
However, to be on the safe side, I would suggest to stop providing the list of secondary files in the final PSet that gets uploaded to central CouchDB / ReqMgr2, while keeping the primary files untouched, at least for the moment.
On what concerns the list of secondary files, I can see that it can be provided through the following attributes in the PSet:
process.mix.input.fileNames
: for classical pileupprocess.mixData.input.fileNames
: for premix pileup
I am unsure though whether there is any other module that could be used for that. We would have to cross check this with framework experts or someone in PdmV.Just in case, I also provide configuration examples:
Hi Alan (@amaltaro),
Thanks for the feedback. Checking with PdmV conveners, all the files listed in that field are primary input files so, based on the approach shared, we can not discard them. Also, there are no secondary files listed for this McM request (the attribute available in JSON file at the same level, secondaryFileNames
, is empty).
Is there any other approach we can follow from the PdmV side to upload this information in the ReqMgr2 config cache or would it be possible to increase the maximum allowed size to accept this information as it is?
Best regards, Geovanny
@ggonzr Hi Geovanny, I see you have the following data structure in your zipped file (json of the PSet):
{"docs": [
{"pset_tweak_details":
{"process":
{"options":
{"source": {"parameters_": ["fileNames", "secondaryFileNames", "inputCommands", "dropDescendantsOfDroppedBranches"]
"fileNames": [HUGE list of LFNs]
both fileNames
and secondaryFileNames
are NOT used as provided in the original PSet configuration, instead WMAgent updates them during the job runtime. As mentioned above, the only exception is for the LHESource, where a list of non-EDM files are provided (their file extension is .lhe
).
Do you think we could identify those specific cases in McM and let them through, while primary and secondary files get removed from the configuration uploaded to CouchDB/ReqMgr2? Those requests should be called wmLHE (or pLHE, I always confuse them!).
Sorry for pushing into the non-easy direction, but this is definitely the most sustainable solution.
@ggonzr hi Geovanny, I just wanted to follow up on this issue and hear whether you have made any modifications on your side and/or if you need further information from the WM side?
Hi Alan (@amaltaro), I performed some tests to retrieve the source type from the cmssw embedded code to check whether the files can be discarded or not following the advice given. Unfortunately, we paused this development due to we had/have other tasks with higher priority to solve. This issue is mainly related to a test request we want to process so there is no hurry on finishing this. There are no changes deployed in our production environments related to this and I will let you know if I require any assistance from your side or if there is any update related to this from the PdmV side.
Thanks, Best regards, Geovanny
Impact of the bug
System affected:
Describe the bug
When McM requires to upload the configuration to ReqMgr2 Config Cache for a request with a large number of files listed in the ‘PSet’ attribute, the request body tends to be big, the HTTP request size is more than 8 MB. This raises issues for uploading the request because the maximum HTTP request size allowed by this DB is 8 MB [1] (as confirmed via email). As a result, this operation returns an HTTP 400 response with the message:
{“error”: “document_too_large”, “reason”: “”}
How to reproduce it
Send an HTTP
POST
request to the endpointhttps://cmsweb.cern.ch/couchdb/reqmgr_config_cache/_bulk_docs
including into the body the JSON content available in the fileBPH-GenericGSmearS-00001.json
(I am attaching it into this issue as an example into a zip file: BPH-GenericGSmearS-00001.zip). This request has around 74K file names registered under the list:docs → (First element) → "pset_tweak_details" → "process" → "source" → "fileNames"
Expected behavior
The HTTP request to upload the configuration to ReqMgr2 Config Cache should be accepted and finished successfully.
Additional context and error message
With the feedback received by email, I think we can have the following solutions available:
Increase the maximum allowed size for an HTTP request by updating the configuration attribute
max_document_size
[1].Reduce the HTTP request size to be lower than the limit. As shared in the email discussion, the list of filenames listed in the
PSet
attribute could be dropped. If so, please describe the conditions that allow us (from PdmV side) to discard this attribute.Thanks, Best regards, Geovanny
References
[1] CouchDB Configuration – max_document_size – Available at: https://docs.couchdb.org/en/stable/config/couchdb.html#couchdb/max_document_size