DQM test `TestDQMGUIUpload` times out

nothingface0 commented 1 week ago

The recently added TestDQMGUIUpload (#46551) has shown to fail even after 10 minutes of waiting, for recent PR tests and an IB:

After checking the logs of the target DQMGUI, the first impression I get is that during periods of heavy dev DQMGUI activity (upload of tier0 replays, PR root files), it looks like it might take a significant amount of time for the file uploaded by the test to be properly registered, meaning that the test fails. If this is the only problem of the test, we could increase the max waiting time.

Unfortunately, I forgot to add %H in the timestamp that is added to the file, so I don't know exactly how much time it takes the DQMGUI to discover each uploaded file, since I only know what time it arrived and was imported, but not when the test started.

I will keep this issue updated as I investigate from the DQM side.

cmsbuild commented 1 week ago

cms-bot internal usage

cmsbuild commented 1 week ago

A new Issue was created by @nothingface0.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 week ago

assign dqm

cmsbuild commented 1 week ago

New categories assigned: dqm

@antoniovagnerini,@rseidita you have been requested to review this Pull request/Issue and eventually sign? Thanks

nothingface0 commented 1 week ago

While debugging, we faced another issue and had to restart the dev DQMGUI, which led to another issue appearing. We are investigating.

Cms-talk post here

smuzaffar commented 1 week ago

@nothingface0 , any idea why only the unit test fail while the dqm bin-by-bin comparison works [a]. dqm bin-bin comparison also uses visDQMUpload.py to upload many root files to https://cmsweb.cern.ch/dqm/dev https://github.com/cms-sw/cmssw/blob/master/DQMServices/FileIO/scripts/compareDQMOutput.py#L73-L99

[a]

Uploading output:Uploading output:

visDQMUpload.py https://cmsweb.cern.ch/dqm/dev /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/cms-bot/dqm-comparison/dqmComparisonOutput/pr/DQM_V0001_R000000001__RelVal_wf10224_0_pr__CMSSW_14_2_X-PRcmssw_46662-65580__DQMIO.root
visDQMUpload.py https://cmsweb.cern.ch/dqm/dev /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/cms-bot/dqm-comparison/dqmComparisonOutput/pr/DQM_V0001_R000000001__RelVal_wf13034_0_pr__CMSSW_14_2_X-PRcmssw_46662-65580__DQMIO.root
Uploading output:
visDQMUpload.py https://cmsweb.cern.ch/dqm/dev /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/cms-bot/dqm-comparison/dqmComparisonOutput/pr/DQM_V0001_R000165121__wf1000_0_pr__CMSSW_14_2_X-PRcmssw_46662-65580__DQMIO.root
Uploading output:Uploading output:

nothingface0 commented 1 week ago

@smuzaffar Regarding the bin-by-bin comparison, from what I understand it's done locally, where the test is running, and then the results are uploaded. In the script you link, there's no validation that the upload itself worked, e.g. by checking the GUI after the upload finished: it's just comparing and uploading.

smuzaffar commented 1 week ago

the unit test is failing at the time of upload [a] in visDQMUpload.py ... right ? And this upload is working for DQM bin-by-bin otherwise we should have seen visDQMUpload.py failing too for dqm bin-bin .... right?

[a]

+ visDQMUpload.py https://cmsweb.cern.ch/dqm/dev DQM_V0001_R000000001__Harvesting__DQMTests202411134029559212184__DQMIO.root
DQM_V0001_R000000001__Harvesting__DQMTests202411134029559212184__DQMIO.root
Using SSL private key /data/cmsbld/jenkins/workspace/ib-run-qa/x509up_u501
Using SSL public key /data/cmsbld/jenkins/workspace/ib-run-qa/x509up_u501
ERROR HTTP Error 500: Internal Server Error
Status code:  None
Message:      None
Detail:       None

nothingface0 commented 1 week ago

Taking this failed test as an example, judging from the logs I found in DQMGUI and the test's logs:

The test file to be uploaded to the DQMGUI was created on 2024-11-12, XX:29:01 (Hour missing due to me forgetting the %H in the date command, run by the test).
The file first appears in DQMGUI's logs at 2024-11-12, 12:29:02, CET (hence I'm assuming XX = 12, meaning that the file was sent within a second of being created).
The file was processed by DQMGUI on 2024-11-12 14:43, CET, more than 2 hours later. It took 6 seconds to do it.
DQMGUI logs around the time of the file's arrival include errors opening root files regarding different PRs (e.g. DQM_V0001_R000000001__RelVal_wf2500_201_base__CMSSW_14_2_X-PRcmssw_46659-65571__DQMIO.root).
There is also a big chunk of time (~2 hours, from 12:45 to 14:41) where DQMGUI was occupied by the processing of a single root file: DQM_V0001_R000380306__wf2024_202001_base__CMSSW_14_2_X-PRcmssw_46666-65573__DQMIO.root. The file itself is small (27K) so I'm thinking something must have broken there.

Takeaway points:

The file uploaded by the test in question is small, arrives almost instantaneously, and is processed quickly, once processing starts.
We definitely can't increase the timeout of the test to 2 hours to include all cases.
We'll need to do some more investigation to find the reasons for the delay in root file processing in the DQMGUI.

nothingface0 commented 1 week ago

Another instance of the failure here.

The file is generated 2024/11/14 14:30:16 CET.
It's uploaded and acknowledged by DQMGUI at 14:30:19 CET (within 3 seconds).
It's received by the GUI's background workers at 15:31:50 CET (at least 1 hour later),
Starts being processed at 15:32:27 and processing finishes by 15:32:33 (6 seconds).

On the other hand, for this successful test:

The file is generated 2024/11/13 18:03:37 (17:03:37 UTC, which is what the date command used to name the test file returns).
It's uploaded and acknowledged by DQMGUI at 18:03:37 CET (within a second).
It's received by the GUI's background workers at 18:03:42 CET (5 seconds later).
Starts being processed at 18:03:50 and processing finishes by 18:03:56 (6 seconds).

smuzaffar commented 1 week ago

How about we disable this test for PRs/IBs. We run it as a special test for each IB ( just like we run tests for crab and hlt) and there we can increase the wait time to few hours (we can run it on lxplus so it will not waste our build resources). If it does not get the processing after let say 6 hours then we can mark it failed?

nothingface0 commented 1 week ago

How about we disable this test for PRs/IBs. We run it as a special test for each IB ( just like we run tests for crab and hlt) and there we can increase the wait time to few hours (we can run it on lxplus so it will not waste our build resources). If it does not get the processing after let say 6 hours then we can mark it failed?

I didn't know there was such an option, sounds good to me! Let me know if any modifications are required for the test.

cms-sw / cmssw

DQM test `TestDQMGUIUpload` times out #46682