Update workflows to monitor in PR profiling

jfernan2 commented 4 months ago

Changed to monitor wf 29834.21 (D110 upgrade) and 12634.21 (Run3 2023)

cmsbuild commented 4 months ago

A new Pull Request was created by @jfernan2 for branch master.

@aandvalenzuela, @cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks. @antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this. cms-bot commands are listed here

cmsbuild commented 4 months ago

cms-bot internal usage

srimanob commented 4 months ago

Hi @jfernan2 Should you use 29834.21 instead of 29634.21 ? It is PU workflow.

cmsbuild commented 4 months ago

Pull request #2282 was updated.

jfernan2 commented 4 months ago

Correct @srimanob I have fixed it Thanks!

smuzaffar commented 4 months ago

enable profiling

smuzaffar commented 4 months ago

please test

cmsbuild commented 4 months ago

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7a973e/40235/summary.html COMMIT: 743195fd4c976150152f9bf126a2e5fd6cad6bdc CMSSW: CMSSW_14_1_X_2024-07-04-1100/el8_amd64_gcc12 Additional Tests: PROFILING User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cms-bot/2282/40235/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially removed 2 lines from the logs
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3345088
DQMHistoTests: Total failures: 3
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3345065
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 202 log files, 165 edm output root files, 48 DQM output files
TriggerResults: no differences found

smuzaffar commented 4 months ago

@jfernan2 , looks like 12634.21 is not a valid workflow (at least it is not in 14.1.X) or do we need to pass any extra option to runTheMatrix.py to run it?

srimanob commented 4 months ago

@jfernan2 , looks like 12634.21 is not a valid workflow (at least it is not in 14.1.X) or do we need to pass any extra option to runTheMatrix.py to run it?

Hi @smuzaffar It will need relvals_opt = --what upgrade as the workflow has not defined in https://github.com/cms-sw/cmssw/blob/master/Configuration/PyReleaseValidation/python/relval_2017.py yet.

srimanob commented 4 months ago

I create the following PR since the workflow should be defined in relval_2017, https://github.com/cms-sw/cmssw/pull/45381

smuzaffar commented 4 months ago

So we need to provide a way to instruct bot how to run workflows which are not active by default for runTheMatrix. We can add a variable e.g PROFILING_OPTS="-w upgrade,standard" in https://github.com/cms-sw/cms-bot/blob/master/cmssw-pr-test-config so that bot can use it when running runTheMatrix

srimanob commented 4 months ago

So we need to provide a way to instruct bot how to run workflows which are not active by default for runTheMatrix. We can add a variable e.g PROFILING_OPTS="-w upgrade,standard" in https://github.com/cms-sw/cms-bot/blob/master/cmssw-pr-test-config so that bot can use it when running runTheMatrix

That would be useful so that we can handle upgrade workflow. Thanks @smuzaffar Since the workflows we need should be in relval_2017 anyways, so maybe we can merge https://github.com/cms-sw/cmssw/pull/45381 (after PR tests) then follows by this PR.

jfernan2 commented 3 months ago

dear @srimanob and @smuzaffar Now that https://github.com/cms-sw/cmssw/pull/45381 has been merged, could we revive this PR? Thanks

smuzaffar commented 3 months ago

please test

smuzaffar commented 3 months ago

+externals looks good

cmsbuild commented 3 months ago

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @rappoccio, @antoniovilela, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

cmsbuild commented 3 months ago

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7a973e/40434/summary.html COMMIT: 743195fd4c976150152f9bf126a2e5fd6cad6bdc CMSSW: CMSSW_14_1_X_2024-07-16-1100/el8_amd64_gcc12 Additional Tests: PROFILING User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cms-bot/2282/40434/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 3 lines to the logs
Reco comparison results: 2 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3345094
DQMHistoTests: Total failures: 3
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3345071
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 202 log files, 165 edm output root files, 48 DQM output files
TriggerResults: no differences found

jfernan2 commented 3 months ago

@smuzaffar after the inclusion of this PR, I thought we could have profiling comparison in PR tests, but it looks like something else is missing, see for example last trial: https://github.com/cms-sw/cmssw/pull/45333 Profiling results for 12634.21 and 29834.21 are there but comparison is empty: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d1a85e/40514/summary.html I fail to see the reason why since the logs shows all OK, apparently: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d1a85e/40514/testsResults/profiling.txt Do you have any hint? Thanks

smuzaffar commented 3 months ago

@jfernan2 , though PR and baseline jobs were run for these workflows but I think we need DQM*.root files for comparison and these works 12634.21 and 29834.21 do not generate such output that is why comparison was not run

jfernan2 commented 3 months ago

No, I was referring to the igprof text results comparison, these comparison files are empty: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d1a85e/40514/profiling/12634.21/RES_CPU_compare_12634.21.txt https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d1a85e/40514/profiling/29834.21/RES_CPU_compare_29834.21.txt Thanks

smuzaffar commented 3 months ago

may be @gartung knows why profiling comparison is empty

makortel commented 3 months ago

may be @gartung knows why profiling comparison is empty

@gartung is on vacation until August 9.

gartung commented 3 months ago

It looks like a segfault in the reco step for 12634.21 https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d1a85e/40514/profiling/12634.21/step3_igprof_cpu.log

gartung commented 3 months ago

Segfault in the reco step for 29834.21 as well https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d1a85e/40514/profiling/29834.21/step3_igprof_cpu.log

jfernan2 commented 3 months ago

Thanks @gartung I am trying a last time since it worked in https://github.com/cms-sw/cms-bot/issues/1912

jfernan2 commented 3 months ago

@gartung the same two workflows 29834.21 and 12634.21 run fine with igprof in Jenkins profiling[1]

I believe the difference is on how igprof is called with -j JobReport

igprof -pp -d -t cmsRun -z -o ./igprofCPU_step3.gz -- cmsRun step3_igprof.py -j step3_igprof_cpu_JobReport.xml >& step3_igprof_cpu.log -> crashes

igprof -d -pp -z -o step3_igprofCPU.gz -t cmsRun cmsRun step3_igprof.py -> runs fine

Somehow it was removed for igprof here[2], since those xml files seem to not be used anywhere, hence I am proposing the following PR if you agree:

https://github.com/cms-cmpwg/profiling/pull/8

[1] https://cmssdt.cern.ch/jenkins/job/release-run-reco-profiling/533/console https://cmssdt.cern.ch/jenkins/job/release-run-reco-profiling/538/console [2] https://github.com/cms-sw/cms-bot/blob/5bac572863f680e3d69e22005e437665b5df666f/reco_profiling/profileRunner.py#L335

makortel commented 3 months ago

I believe the difference is on how igprof is called with -j JobReport

I'd find it very strange if the framework job report would be causing segfaults under IgProf, but who knows.

jfernan2 commented 3 months ago

Me too @makortel but I have just repeated the test in Jenkins with success for both workflows using igprof and no JobReport output....

jfernan2 commented 2 months ago

@smuzaffar igprof is still giving problems, however the baseline seems to not be running igprof for the two wfs in question (12634.21 and 29834.21), so we miss the reference anyway, see fopr example this recent trial: https://cmssdt.cern.ch/SDT/jenkins-artifacts/ib-baseline-tests/CMSSW_14_2_X_2024-09-01-2300/el8_amd64_gcc12/-GenuineIntel/matrix-results/

Any idea? Thanks

cms-sw / cms-bot

Update workflows to monitor in PR profiling #2282

Comparison Summary

Comparison Summary