cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.33k forks source link

Moved Egamma TensorFlow sessions to global cache #46655

Closed valsdav closed 1 day ago

valsdav commented 2 weeks ago

PR description:

Solves #46494. Related to #46040

EgammaDNNHelper was storing TF graphs globally by job, but TF sessions were owned my the GSFElectronProducer and GEDPhotonProducers by stream. This PR moves the TFSessions in the EgammaDNNHelper, making them global by job.

PR validation:

Purely technical PR, no changes expected.

valsdav commented 2 weeks ago

assign ml

cmsbuild commented 2 weeks ago

cms-bot internal usage

cmsbuild commented 2 weeks ago

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46655/42576

cmsbuild commented 2 weeks ago

New categories assigned: ml

@valsdav,@y19y19 you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 2 weeks ago

A new Pull Request was created by @valsdav for master.

It involves the following packages:

@cmsbuild, @jfernan2, @mandrenguyen, @valsdav, @y19y19 can you please review it and eventually sign? Thanks. @Prasant1993, @Sam-Harper, @a-kapoor, @afiqaize, @jainshilpi, @lgray, @missirol, @ram1123, @rovere, @sameasy, @sobhatta, @varuns23 this is something you requested to watch as well. @antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

valsdav commented 2 weeks ago

please test

cmsbuild commented 2 weeks ago

-1

Failed Tests: UnitTests RelVals RelVals-INPUT AddOn Size: This PR adds an extra 44KB to repository Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0fe9c2/42719/summary.html COMMIT: 90d7ded04bc3a8de4099feaadf777a77b80876ec CMSSW: CMSSW_14_2_X_2024-11-11-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46655/42719/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 4 errors in the following unit tests:

---> test test_MC_22_crosscheck had ERRORS
---> test test_MC_23_crosscheck had ERRORS
---> test test_MC_23_setup had ERRORS
and more ...

RelVals

RelVals-INPUT

AddOn Tests

valsdav commented 2 weeks ago

please test

valsdav commented 2 weeks ago

please abort test

valsdav commented 2 weeks ago

please test

cmsbuild commented 2 weeks ago

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46655/42587

cmsbuild commented 2 weeks ago

Pull request #46655 was updated. @jfernan2, @mandrenguyen, @valsdav, @y19y19 can you please check and sign again.

cmsbuild commented 2 weeks ago

-1

Failed Tests: Build ClangBuild Size: This PR adds an extra 20KB to repository Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0fe9c2/42736/summary.html COMMIT: 83d1d980bc4d87c684d97c733c98984907aec08a CMSSW: CMSSW_14_2_X_2024-11-11-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46655/42736/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation warning when building: See details on the summary page.

Clang Build

I found compilation error while trying to compile with clang. Command used:

USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' scram build -k -j 32 COMPILER='llvm compile'
>> Entering Package RecoEgamma/EgammaPhotonProducers
>> Entering Package RecoEgamma/EgammaTools
>> Entering Package RecoEgamma/ElectronIdentification
>> Entering Package RecoEgamma/PhotonIdentification
>> Compile sequence completed for CMSSW CMSSW_14_2_X_2024-11-11-1100
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 1
+ eval scram build outputlog '&&' '(python3' /data/cmsbld/jenkins/workspace/ib-run-pr-tests/cms-bot/buildLogAnalyzer.py --logDir /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_2_X_2024-11-11-1100/tmp/el8_amd64_gcc12/cache/log/src '||' 'true)'
++ scram build outputlog
>> Entering Package RecoEgamma/EgammaElectronProducers
Entering library rule at src/RecoEgamma/EgammaElectronProducers/plugins
>> Compiling edm plugin src/RecoEgamma/EgammaElectronProducers/plugins/ElectronNHitSeedProducer.cc

cmsbuild commented 2 weeks ago

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46655/42591

cmsbuild commented 2 weeks ago

Pull request #46655 was updated. @cmsbuild, @jfernan2, @mandrenguyen, @valsdav, @y19y19 can you please check and sign again.

valsdav commented 2 weeks ago

please test

cmsbuild commented 1 week ago

+1

Size: This PR adds an extra 20KB to repository Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0fe9c2/42744/summary.html COMMIT: 22246a6567688228476da6fa57a6dd9cb62255f7 CMSSW: CMSSW_14_2_X_2024-11-11-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46655/42744/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

valsdav commented 1 week ago

Hi @makortel, I would like to reproduce the memory profile of #46040 to compare the memory usage. Would you have any script to share to reproduce the same igProf memory profile that's in the issue? Thanks!

makortel commented 1 week ago

Below is an untested recipe to reproduce what I did for https://github.com/cms-sw/cmssw/issues/46494

# Set up developer area
cmssw-el7
export SCRAM_ARCH=slc7_amd64_gcc12
cmsrel CMSSW_14_0_16
cd CMSSW_14_0_16/src
cmsenv

# Set up the configuration
cp /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386319_Muon1/job/WMTaskSpace/cmsRun1/PSet.p* .
xrdcp root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024H/Muon1/RAW/v1/000/386/319/00000/27604955-1005-455b-90eb-8195e2f02cb8.root .
cat >>PSet.py <<EOF
process.source.fileNames = ["file:27604955-1005-455b-90eb-8195e2f02cb8.root"]
process.maxEvents.input = 20
process.options.numberOfThreads = 4
process.options.numberOfStreams = 4
process.IgProfService = cms.Service('IgProfService',
    reportFirstEvent = cms.untracked.int32(0),
    reportEventInterval = cms.untracked.int32(10),
    reportToFileAtPostEvent     = cms.untracked.string('| gzip -c > 4th_4st_07.mem.%I.gz')
)
EOF

# Run IgProf memory profiler
igprof -d -t cmsRunGlibC -z -mp -o test_07.mem.gz cmsRunGlibC PSet.py > out.txt 2>&1
igprof-analyse --sqlite -d -v -g -r MEM_LIVE test_4th_4st_07.mem.20.gz | sqlite3 test_4th_4st_07.20_live.sql3

# Make the memory profile web browseable (needs a web area with cgi-bin access)
cp (which igprof-navigator) <web dir>/cgi-bin
mkdir <web dir>/cgi-bin/data
cp test_4th_4st_07.20_live.sql3 <web dir>/cgi-bin/data

Open web browser to <web dir>/cgi-bin/igprof-navigator/test_4th_4st_07.20_live.

valsdav commented 1 week ago

Thanks a lot! Unfortunately the data file is no more available, can I just use another one?

cmsbuild commented 5 days ago

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46655/42713

cmsbuild commented 5 days ago

Pull request #46655 was updated. @cmsbuild, @jfernan2, @mandrenguyen, @valsdav, @y19y19 can you please check and sign again.

valsdav commented 5 days ago

Below is an untested recipe to reproduce what I did for #46494

# Set up developer area
cmssw-el7
export SCRAM_ARCH=slc7_amd64_gcc12
cmsrel CMSSW_14_0_16
cd CMSSW_14_0_16/src
cmsenv

# Set up the configuration
cp /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386319_Muon1/job/WMTaskSpace/cmsRun1/PSet.p* .
xrdcp root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024H/Muon1/RAW/v1/000/386/319/00000/27604955-1005-455b-90eb-8195e2f02cb8.root .
cat >>PSet.py <<EOF
process.source.fileNames = ["file:27604955-1005-455b-90eb-8195e2f02cb8.root"]
process.maxEvents.input = 20
process.options.numberOfThreads = 4
process.options.numberOfStreams = 4
process.IgProfService = cms.Service('IgProfService',
    reportFirstEvent = cms.untracked.int32(0),
    reportEventInterval = cms.untracked.int32(10),
    reportToFileAtPostEvent     = cms.untracked.string('| gzip -c > 4th_4st_07.mem.%I.gz')
)
EOF

# Run IgProf memory profiler
igprof -d -t cmsRunGlibC -z -mp -o test_07.mem.gz cmsRunGlibC PSet.py > out.txt 2>&1
igprof-analyse --sqlite -d -v -g -r MEM_LIVE test_4th_4st_07.mem.20.gz | sqlite3 test_4th_4st_07.20_live.sql3

# Make the memory profile web browseable (needs a web area with cgi-bin access)
cp (which igprof-navigator) <web dir>/cgi-bin
mkdir <web dir>/cgi-bin/data
cp test_4th_4st_07.20_live.sql3 <web dir>/cgi-bin/data

Open web browser to <web dir>/cgi-bin/igprof-navigator/test_4th_4st_07.20_live.

Hi @makortel, I managed to replicate the recipe. There was a crash at the end of the processing but it seems that the igprof report has been filled. Unfortunately my cgi-bin setup has some problem (it was working in the past). Would you mind copying the file in your folder please? https://dvalsecc.web.cern.ch/cgi-bin/data/test_4th_4st_07.20_live.sql3

valsdav commented 5 days ago

please test

cmsbuild commented 5 days ago

+1

Size: This PR adds an extra 20KB to repository Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0fe9c2/42966/summary.html COMMIT: 2a58ae2f7976db3ca790e20cc381687d99933216 CMSSW: CMSSW_14_2_X_2024-11-19-2300/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/46655/42966/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

cmsbuild commented 3 days ago

Milestone for this pull request has been moved to CMSSW_15_0_X. Please open a backport if it should also go in to CMSSW_14_2_X.

jfernan2 commented 3 days ago

solves https://github.com/cms-sw/cmssw/pull/46494

jfernan2 commented 3 days ago

+1

valsdav commented 3 days ago

+ml

Technical PR, moved TF sessions to global cache

cmsbuild commented 3 days ago

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @mandrenguyen, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

valsdav commented 3 days ago

@makortel we probably want to backport this to the release used in T0 processing at the moment, right?

makortel commented 3 days ago

Would you mind copying the file in your folder please? https://dvalsecc.web.cern.ch/cgi-bin/data/test_4th_4st_07.20_live.sql3

I can, but I'd need a path to the file on lxplus that I could access (access through https results in the cgi-bin script being called that fails with internal server error, and I don't have permissions to access the file through EOS)

makortel commented 3 days ago

we probably want to backport this to the release used in T0 processing at the moment, right?

Based on https://github.com/cms-sw/cmssw/issues/46494 the expected memory savings would be about 7 MB for an 8-stream job. It isn't much compared to a full reconstruction job, but it is something nevertheless. My take is that backports would not be essential, but if done, I'd do backport also to 14_0_X to potentially benefit the MC production.

mandrenguyen commented 1 day ago

+1