cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.31k forks source link

Fix memory leak issues in Calo Layer 1 UCT Infrastructure #46543

Closed aloeliger closed 7 hours ago

aloeliger commented 2 days ago

PR description:

This PR should handle some of the worst offending pointer issues and memory leak problems in https://github.com/cms-sw/cmssw/issues/46526. This swaps a lot of Calo Layer 1 Infrastructure pointers over to std::shared_ptr's to try and avoid memory leak issues. Summary card regions are now stored as the objects themselves, instead of pointers to them (That decision may not be directly necessary? It makes sense to me, and there is a spot where pointers to these things are used internally, and nullptrs are possible but checked against)

This is just a first pass, but a larger fix of all of this infrastructure with smart pointers should be considered.

@mmusich @makortel

PR validation:

Code compiles, and has had formatting applied

To be updated with further tests. PR still in testing.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

This PR is not a backport, but may need backporting to releases responsible for current prompt reco processing for patches.

cmsbuild commented 2 days ago

cms-bot internal usage

cmsbuild commented 2 days ago

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46543/42416

cmsbuild commented 2 days ago

A new Pull Request was created by @aloeliger for master.

It involves the following packages:

@aloeliger, @cmsbuild, @epalencia can you please review it and eventually sign? Thanks. @Martin-Grunewald, @missirol, @mmusich this is something you requested to watch as well. @antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

mmusich commented 2 days ago

@cmsbuild, please test

cmsbuild commented 2 days ago

-1

Failed Tests: ClangBuild Size: This PR adds an extra 40KB to repository Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-4cff68/42450/summary.html COMMIT: 1310c975b4d5d30c508a5fbcee7d31628848787c CMSSW: CMSSW_14_2_X_2024-10-29-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46543/42450/install.sh to create a dev area with all the needed externals and cmssw changes.

Clang Build

I found compilation warning while trying to compile with clang. Command used:

USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' scram build -k -j 32 COMPILER='llvm compile'

See details on the summary page.

aloeliger commented 2 days ago

Weird. Didn't see the set but unused warnings when I compiled it. Let me just double check and I'll pull those out tomorrow morning.

mmusich commented 1 day ago

Didn't see the set but unused warnings when I compiled it.

Normal compilation won't show the problem. In order to reproduce you need to run clang (using the command scram build -k -j 32 COMPILER='llvm compile').

cmsbuild commented 1 day ago

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46543/42424

cmsbuild commented 1 day ago

Pull request #46543 was updated. @aloeliger, @cmsbuild, @epalencia can you please check and sign again.

aloeliger commented 1 day ago

please test

cmsbuild commented 1 day ago

-1

Failed Tests: ClangBuild Size: This PR adds an extra 44KB to repository Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-4cff68/42468/summary.html COMMIT: 730f8be9128c6fe097b620f920b5d551bbdedffe CMSSW: CMSSW_14_2_X_2024-10-30-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46543/42468/install.sh to create a dev area with all the needed externals and cmssw changes.

Clang Build

I found compilation warning while trying to compile with clang. Command used:

USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' scram build -k -j 32 COMPILER='llvm compile'

See details on the summary page.

cmsbuild commented 1 day ago

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46543/42426

cmsbuild commented 1 day ago

Pull request #46543 was updated. @aloeliger, @cmsbuild, @epalencia can you please check and sign again.

aloeliger commented 1 day ago

I'm doing some reconsidering on all the stuff I'm ripping out, because there's a large amount of code that was dedicated to (ultimately unused) calculations of Calo Layer 1 variables that were leaking memory. I don't know whether these variables may find a use in the future, or at the very least, we dont want to lose any knowledge related to them, so I may add these back, but in a memory safe way.

aloeliger commented 1 day ago

I think this should be a little better.

cmsbuild commented 1 day ago

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46543/42427

Code check has found code style and quality issues which could be resolved by applying following patch(s)

cmsbuild commented 1 day ago

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46543/42428

aloeliger commented 1 day ago

please test

cmsbuild commented 1 day ago

Pull request #46543 was updated. @aloeliger, @epalencia can you please check and sign again.

cmsbuild commented 1 day ago

-1

Failed Tests: RelVals-INPUT Size: This PR adds an extra 40KB to repository Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-4cff68/42471/summary.html COMMIT: 131ef65b021a801913a3dff9d4d10f789edce0c4 CMSSW: CMSSW_14_2_X_2024-10-30-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46543/42471/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

Comparison Summary

Summary:

mandrenguyen commented 1 day ago

ignore tests-rejected with ib-failure See https://github.com/cms-sw/cmssw/issues/46555

mmusich commented 19 hours ago

@aloeliger what's the status of this PR?

aloeliger commented 19 hours ago

@mmusich it "should" be fine because nothing really changed all that much, just went from pointers to smart pointers, so I could sign now with relative confidence, but I'd appreciate the chance to test it.

mmusich commented 18 hours ago

urgent

makortel commented 18 hours ago

Is this code run in any of the runTheMatrix workflows?

makortel commented 17 hours ago

In any case comparing the logs of 12834.0 step3 the allocated memory used at the process exit time decreases from 1681833112 bytes to 1681269224, i.e. by 551 kB, which is close to the expected reduction of 497 for 10 events. On the other hand, the relative reduction is 0.03 %, and I don't know if the MaxMemoryPreload report is that precise.

mmusich commented 17 hours ago

FWIW I've run 100k events out of the "tarball" reported from Tier0 (see description) and compared the RSS with / without this PR:

image

things seem to go in the right direction.

Dr15Jones commented 16 hours ago

So I also did a test using a trivial fast workflow (Neutrino beam as start) and used the MaxAllocMonitor before and after using this fix

image
mmusich commented 16 hours ago

@aloeliger, please address the review comments https://github.com/cms-sw/cmssw/pull/46543#discussion_r1824113109 and https://github.com/cms-sw/cmssw/pull/46543#discussion_r1824114005 at earliest convenience.

but I'd appreciate the chance to test it.

can you clarify what's left to test? Also we need urgently a backport to CMSSW_14_1_X to build a patch for the incoming PbPb operations (as early as next Monday).

mandrenguyen commented 15 hours ago

Indeed, I would like to build a new 14_1_X tonight. If we can get the backport of this today, would be great.

aloeliger commented 15 hours ago

My apologies for all the delays, I'm in the middle of moving from CERN back to the US. I wanted to just make sure CICADA emulation was the same before and after this PR but I think given the urgency I can forgo that. I'll make the requested changes when I'm back in front of computer, and make the backport. ~2 hours maybe?

aloeliger commented 14 hours ago

PR will be good to go after retesting after applying the suggestions.

aloeliger commented 14 hours ago

I'll prepare the backport immediately.

aloeliger commented 14 hours ago

@mandrenguyen Do we need backports to 14_0?

cmsbuild commented 14 hours ago

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46543/42452

Code check has found code style and quality issues which could be resolved by applying following patch(s)

cmsbuild commented 14 hours ago

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46543/42453

cmsbuild commented 14 hours ago

Pull request #46543 was updated. @aloeliger, @cmsbuild, @epalencia can you please check and sign again.

aloeliger commented 14 hours ago

please test

cmsbuild commented 9 hours ago

-1

Failed Tests: RelVals-INPUT Size: This PR adds an extra 20KB to repository Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-4cff68/42510/summary.html COMMIT: 0007e31239c4ff265cd985cd33e03ef94ce7f429 CMSSW: CMSSW_14_2_X_2024-10-30-2300/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46543/42510/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

Comparison Summary

Summary:

aloeliger commented 9 hours ago

+l1

@mmusich @mandrenguyen FYI

cmsbuild commented 9 hours ago

This pull request is fully signed and it will be integrated in one of the next master IBs (but tests are reportedly failing). This pull request will now be reviewed by the release team before it's merged. @mandrenguyen, @rappoccio, @antoniovilela, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

mandrenguyen commented 7 hours ago

ignore tests-rejected with ib-failure

mandrenguyen commented 7 hours ago

+1