cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.32k forks source link

Parallelized offline vertexing reconstruction #46663

Open cericeci opened 1 week ago

cericeci commented 1 week ago

PR description:

This PR contains a first implementation of the parallelized offline vertexing using Alpaka. We intend this as a first PR to request feedback and comments. It includes contributions from @alexstrel that were squashed in during the porting to 14_2.

The content of the PR was previously discussed on: https://indico.cern.ch/event/1442046/#4-status-of-offline-vertexing

PR validation:

When running the code-checks there are several warnings such as:

Processing tmp/el9_amd64_gcc12/code-checks/RecoVertex/PrimaryVertexProducer_Alpaka/plugins/alpaka/PrimaryVertexProducer_Alpaka.cc.yaml Deleting: No Diagnostics found

which I'm not sure about their origin. Otherwise it properly compiles and (local) validation works. The added code should not touch any other packages.

cmsbuild commented 1 week ago

cms-bot internal usage

cmsbuild commented 1 week ago

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46663/42589

Code check has found code style and quality issues which could be resolved by applying following patch(s)

cericeci commented 1 week ago

Sorry, seems like when I rebased this to 14_2 there was some mix up on my side with the validation branch. Is it ok to push on this branch or do I close and open a new PR from the proper one?

makortel commented 1 week ago

You can push to the same branch.

cmsbuild commented 1 day ago

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46663/42721

cmsbuild commented 1 day ago

A new Pull Request was created by @cericeci for master.

It involves the following packages:

The following packages do not have a category, yet:

DataFormats/PortableVertex RecoVertex/PrimaryVertexProducer_Alpaka Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks. @GiacomoSguazzoni, @VinInn, @VourMa, @dgulhan, @fabiocos, @martinamalberti, @missirol, @mmusich, @mtosi, @rovere this is something you requested to watch as well. @antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

cericeci commented 1 day ago

Thanks @makortel. As I had to rebase the proper branch to 14_2 anyhow I ended up updating cherrypicking changes by hand and updating this one. It is now the correct code

jfernan2 commented 1 day ago

assign heterogeneous

cmsbuild commented 1 day ago

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

jfernan2 commented 1 day ago

enable gpu

jfernan2 commented 1 day ago

please test

mmusich commented 1 day ago

RecoVertex/PrimaryVertexProducer_Alpaka (****)

do we need a new subsystem? Elsewhere alpaka-specific code was put in a alpaka sub-folder, for example RecoLocalTracker/SiPixelClusterizer/plugins/alpaka or RecoTracker/PixelSeeding/plugins/alpaka

fwyzard commented 1 day ago

This PR contains a first implementation of the parallelized offline vertexing using Alpaka. We intend this as a first PR to request feedback and comments. It includes contributions from @alexstrel that were squashed in during the porting to 14_2.

Thanks @cericeci !

Given that the commit history does not seem very relevant, could you squash all commits into a single one, and add

Co-authored-by: Alexei Strelchenko <astrel@fnal.gov>

at the bottom of the commit message ?

makortel commented 1 day ago

RecoVertex/PrimaryVertexProducer_Alpaka (****)

do we need a new subsystem? Elsewhere alpaka-specific code was put in a alpaka sub-folder, for example RecoLocalTracker/SiPixelClusterizer/plugins/alpaka or RecoTracker/PixelSeeding/plugins/alpaka

In this case the compilation time is very long (IIRC tens of minutes, for reasons not yet fully understood), so a separate package could make sense. The set of dependencies is also somewhat (about 50 %) different.

cmsbuild commented 1 day ago

+1

Size: This PR adds an extra 56KB to repository Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-f5bc58/42979/summary.html COMMIT: 86c21bde2e229d901721cc47dd239b5afe1ce1b3 CMSSW: CMSSW_14_2_X_2024-11-20-1100/el8_amd64_gcc12 Additional Tests: GPU User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/46663/42979/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

jfernan2 commented 1 day ago

enable profiling

fwyzard commented 1 day ago

do we need a new subsystem? Elsewhere alpaka-specific code was put in a alpaka sub-folder, for example RecoLocalTracker/SiPixelClusterizer/plugins/alpaka or RecoTracker/PixelSeeding/plugins/alpaka

In this case the compilation time is very long (IIRC tens of minutes, for reasons not yet fully understood), so a separate package could make sense. The set of dependencies is also somewhat (about 50 %) different.

Either way the name PrimaryVertexProducer_Alpaka is not good. If this needs to be a separate package, I would suggest PortablePrimaryVertexProducer.

makortel commented 1 day ago

Either way the name PrimaryVertexProducer_Alpaka is not good. If this needs to be a separate package, I would suggest PortablePrimaryVertexProducer.

That I fully agree (and kind of already commented in https://github.com/cms-sw/cmssw/pull/46663#discussion_r1850908433). In case of a separate package, I think the package name should match the (main) producer name. We have some existing use of a Portable postfix for EDModule names, but no existing use of Portable prefix.

makortel commented 1 day ago

@smuzaffar Do we have any monitoring of the compilation time of a PR?

makortel commented 1 day ago

enable profiling

@jfernan2 Just curious, why profiling? As the PR stands presently, the added code is not part of any configuration (but perhaps that should change).

smuzaffar commented 1 day ago

@smuzaffar Do we have any monitoring of the compilation time of a PR?

If PR job is still in jenkins then yes we can find the time it took for compilation otherwise we can only get the total time it took run the PR test job.

By the way, compilation time can vary depending on what other packages were checkout

makortel commented 1 day ago

@smuzaffar Would it be feasible to run the scram b -v -k -j 16 part e.g. through /usr/bin/time to have something in the compilation log? I understand the caveats, I'm mostly thinking of monitoring (not even catching) of "outrageously long" compilation times when those are known beforehand (at least for now).

(by the way, why do I see >> Compiling edm plugin src/RecoVertex/PrimaryVertexProducer/plugins/PrimaryVertexProducer.cc three times in the build log?)

smuzaffar commented 23 hours ago

@makortel , sure I can add /usr/bin/time for scram build. yes multiple >> Compiling edm plugin src/RecoVertex/PrimaryVertexProducer/plugins messages seems incorrect. I am looking in to it

smuzaffar commented 23 hours ago

@makortel , ok I think I know why we have duplicate compilation messages. Bot runs

COMPILATION_CMD="scram b vclean && BUILD_LOG=yes $USER_FLAGS scram b ${BUILD_VERBOSE} -k -j ${NCPU}"
eval $COMPILATION_CMD

so we get two compilation messages. One directly from scram and 2nd from the eval itself. scram actually compiles the sources once but due to the use of eval we see duplicates.

The third message is coming from https://github.com/cms-sw/cms-bot/blob/master/pr_testing/test_multiple_prs.sh#L1149 where we try to get any log messages where were in tmp logs files but were not printed ( in case of build failures).

I will update bot to avoid using eval

makortel commented 23 hours ago

Thanks @smuzaffar!

jfernan2 commented 22 hours ago

enable profiling

@jfernan2 Just curious, why profiling? As the PR stands presently, the added code is not part of any configuration (but perhaps that should change).

You are right @makortel I did not realize there is no configuration added into the PR, I was eager to find a candidate PR to test the new profiling script. I am sorry

jfernan2 commented 22 hours ago

enable none

smuzaffar commented 22 hours ago

so we get two compilation messages. One directly from scram and 2nd from the eval itself. scram actually compiles the sources once but due to the use of eval we see duplicates.

ah its not eval but BUILD_LOG=yes flag which build rule uses to capture all logs in a file first and print once the product compilation is done. There was a bug that with BUILD_LOG=yes these log files were sent to stdout at the end of compilation again.

https://github.com/cms-sw/cmsdist/pull/9526 should fix the duplicate build messages

cmsbuild commented 1 hour ago

Milestone for this pull request has been moved to CMSSW_15_0_X. Please open a backport if it should also go in to CMSSW_14_2_X.