Optimize new Energy Correlation Functions from fastjet

perrotta commented 7 years ago

Integration of PR #19601 lead to some sizeable increase in computation time, quantified as follows for the new modules:

TimeReport 0.044377 0.044377 0.044377 nb1AK8PuppiSoftDrop TimeReport 0.011530 0.011530 0.011530 nb1AK8PuppiSoftDropSubjets TimeReport 0.043666 0.043666 0.043666 nb2AK8PuppiSoftDrop TimeReport 0.011512 0.011512 0.011512 nb2AK8PuppiSoftDropSubjets

(See discussions in the thread of that PR).

It is expected that the energy correlator for three particles can be tuned with a tighter pt cut, lowering as such the largest CPU eater in the fastjet algo. (see https://github.com/cms-sw/cmssw/pull/19601#issuecomment-329796215). However some physics performance study is needed before implementing that cut in the code.

Another possible way of reducing the CPU needed could come from the possibility of running common parts of the different modules only once: not clear how, but this possibility should also get explored,

While signing for reco the pull request with the new energy correlation functions, I open this issue is just to track the efforts meant to a more appropriate tuning of the algorithms, and to not forget them

@rappoccio @slava77

perrotta commented 7 years ago

assign reco

cmsbuild commented 7 years ago

A new Issue was created by @perrotta .

@davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 7 years ago

assign reconstruction

cmsbuild commented 7 years ago

New categories assigned: reconstruction

@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

rappoccio commented 7 years ago

By only running the N3 variable if the jet pt satisfies pt > 250 GeV, I get these timings:

TimeReport   0.025502     0.025502     0.025502  nb1AK8PuppiSoftDrop
TimeReport   0.011759     0.011759     0.011759  nb1AK8PuppiSoftDropSubjets
TimeReport   0.025428     0.025428     0.025428  nb2AK8PuppiSoftDrop
TimeReport   0.011685     0.011685     0.011685  nb2AK8PuppiSoftDropSubjets

So an additional factor of two improvement. This was tested over 1000 events of workflow 10024.0. These are each about 2x as expensive as a new jet collection:

TimeReport   0.012430     0.012430     0.012430  ak4PFJets
TimeReport   0.012290     0.012290     0.012290  ak4PFJetsCHS
TimeReport   0.011505     0.011505     0.011505  ak4PFJetsCHSNoHF 
TimeReport   0.012218     0.012218     0.012218  ak4PFJetsPuppi
TimeReport   0.015820     0.015820     0.015820  ak8PFJetsCHS
TimeReport   0.015537     0.015537     0.015537  ak8PFJetsPuppi

and around the same as the inclusiveCandidateVertexFinder:

TimeReport   0.019335     0.019335     0.019335  inclusiveCandidateVertexFinder
TimeReport   0.019560     0.019560     0.019560  inclusiveCandidateVertexFinderCvsL

and the HPS tau primary vertex producer:

TimeReport   0.019633     0.019633     0.019633  hpsPFTauPrimaryVertexProducer

perrotta commented 7 years ago

Thank you @rappoccio I suppose it was the 10224 wf instead (TTbar with PU, the one we used for reference in the PR): could you please confirm? Now one should make sure that the pt=250 cut is ok for physics, then a PR with it can be submitted

rappoccio commented 7 years ago

Oh, bother, I have indeed been using 10024, not 10224. That may explain the differences we have been seeing!

The pt > 250 GeV cut is okay for physics because this cut won't be relevant until top quarks start to merge, typically around pt > 400 GeV. This is already (probably too) conservative, to be able to catch the corner cases when the groomed pt is much below the ungroomed threshold of pt > 400 GeV.

rappoccio commented 7 years ago

With 10224 over 10 events, I get these timings:

TimeReport   0.002867     0.002867     0.002867  nb1AK8PuppiSoftDrop
TimeReport   0.008370     0.008370     0.008370  nb1AK8PuppiSoftDropSubjets
TimeReport   0.002594     0.002594     0.002594  nb2AK8PuppiSoftDrop
TimeReport   0.008307     0.008307     0.008307  nb2AK8PuppiSoftDropSubjets

I don't understand why the timing of the two WFs should be much different.

slava77 commented 6 years ago

assign upgrade

In PU200 phase-2 workflows the nb[12] modules take time similar to the total reco time. In my test of 100 events the average from the 4 modules is 100 s/event excluding one outlier event and including that event the average goes up to 340 s/event (that single event took 6 hours). Even without the outlier this is more than the totality of tracking.

"luckily" enough the #19601 didn't make it to 93X.

@rappoccio @kpedro88 you may want to disable this in phase-2 workflows. It's worth checking pp (e.g. JetHT) around PU50 to confirm that we are not paying for these correlation variables as much as tracking.

cmsbuild commented 6 years ago

New categories assigned: upgrade

@kpedro88 you have been requested to review this Pull request/Issue and eventually sign? Thanks

slava77 commented 6 years ago

BTW, I already disabled these in pp_on_XeXe

kpedro88 commented 6 years ago

@slava77 thanks for checking this. This seems to fall into the general category of my complaints that fastjet is too slow. I'll try to find some time to look at the code in case there are any obvious optimizations.

Can you run igprof pp?

rappoccio commented 6 years ago

For now we will have to disable it. I will come up with some better solution ASAP.

slava77 commented 6 years ago

On 10/18/17 7:39 AM, Kevin Pedro wrote:

@slava77 https://github.com/slava77 thanks for checking this. This seems to fall into the general category of my complaints that fastjet is too slow. I'll try to find some time to look at the code in case there are any obvious optimizations.

Can you run igprof pp?

Sal ran the profiler and posted results earlier. Perhaps a copy of that could be added here. IIRC, almost all of the cost was in the external function calls. I have not investigated is this can be reduced by passing data to the external code or making calls in a slightly different way.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/20558#issuecomment-337613500, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdcbvpCUt0nopPmvkc7Fo5RkH0_MlPPks5stg2HgaJpZM4Pa3VN.

rappoccio commented 6 years ago

I think we are just running into the O(N^3) nature of N3. I have contacted the authors to see if we can perhaps try some other methods (perhaps preclustering with some small R to reduce the number of inputs?)

kpedro88 commented 6 years ago

@rappoccio I see they're using caching of the input factors. However, it might be possible to do further caching to reduce repeated multiplications. I wrote a quick test program to print out all the computations done in the N3 loop:

ans_ij = energyStore[2] * energyStore[1] * angleStore[2][1];
answer += ans_ij * energyStore[0] * angleStore[2][0] * angleStore[1][0];
ans_ij = energyStore[3] * energyStore[1] * angleStore[3][1];
answer += ans_ij * energyStore[0] * angleStore[3][0] * angleStore[1][0];
...

You can see that the multiplication energyStore[0] * ... * angleStore[1][0] comes up more than once. This corresponds in general to energyStore[k] * ... * angleStore[j][k]. If this were precomputed alongside the angleStore 2D vector, it might save some CPU time. (At a cost of some memory, but I don't think we're memory-limited in this algorithm right now.)

kpedro88 commented 6 years ago

Also, I think std::vector< std::vector<double> > angleStore; is not the best for memory locality. One could try to make it a one-dimensional vector accessed by a function of the two-dimensional indices.

kpedro88 commented 6 years ago

(For reference, the earlier igprof report is here: https://github.com/cms-sw/cmssw/pull/19601#issuecomment-329796215, and the source code is here: https://fastjet.hepforge.org/trac/browser/contrib/contribs/EnergyCorrelator/trunk/EnergyCorrelator.cc#L106.)

rappoccio commented 6 years ago

Do you think this will improve more than O(1)? It's probably okay, but I think somehow reducing N may be a good check. I will try to either calculate with only the first, say, 20 particles sorted by pt, or else we can precluster with some very small jet radius (although the former is probably better).

kpedro88 commented 6 years ago

In a triple-nested loop, reducing computation and improving memory locality can make a significant difference.

Reducing N will of course help, but I always try to improve code before making physics approximations...

rappoccio commented 6 years ago

Ok will have to look at it ASAP.

kpedro88 commented 6 years ago

A very quick and dirty test of precomputing energyStore[k] * angleStore[j][k] into a 2D vector energyAngleProd shows a 16% reduction in CPU usage with particle size 1000 (just generating random values for the energies and angles).

rappoccio commented 6 years ago

Ok we need factors of 100 or more though.

kpedro88 commented 6 years ago

The memory locality improvement reduces CPU usage by ~20%. Combined, both improvements reduce CPU usage by ~40%.

This is just in the single scenario I'm testing very quickly, so the absolute improvement may change depending on the number of particles. But I think a 40% increase is nothing to sneer at and would be appreciated by all users of this code, even if it isn't enough to solve the performance in 200PU.

slava77 commented 6 years ago

On 10/18/17 9:49 AM, Kevin Pedro wrote:

The memory locality improvement reduces CPU usage by ~20%. Combined, both improvements reduce CPU usage by ~40%.

try also double -> float. unless the code needs to add/subtract huge numbers with small ones.

This is just in the single scenario I'm testing very quickly, so the absolute improvement may change depending on the number of particles. But I think a 40% increase is nothing to sneer at and would be appreciated by all users of this code, even if it isn't enough to solve the performance in 200PU.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/20558#issuecomment-337654851, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdcbnwk0VOD8u391EeI2L1cKNUQznAcks5stiwZgaJpZM4Pa3VN.

kpedro88 commented 6 years ago

I was just thinking about that. The CPU reduction from double->float in this case only seems to be ~1-2%, though.

slava77 commented 6 years ago

On 10/18/17 10:01 AM, Kevin Pedro wrote:

I was just thinking about that. The CPU reduction from double->float in this case only seems to be ~1-2%, though.

strange did your test change everything to floats? if the rest of the math stays in double, there is an extra cost to do double-float conversion

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/20558#issuecomment-337658394, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdcbuRq5ZBLF9RYyFbtPnjsV7EVnkl9ks5sti7ogaJpZM4Pa3VN.

rappoccio commented 6 years ago

Oh, not saying we shouldn't do it, just that we need to keep digging ;)

kpedro88 commented 6 years ago

@slava77 I converted all the variables/values in the code from doubles to floats. It just doesn't seem to make much of a difference. Here's a Dropbox folder with my various test files and a script to run them: https://www.dropbox.com/sh/1t8kbflin9hmknl/AADW4HTdbNF8bd2zimCkap7Na?dl=0

rappoccio commented 6 years ago

The authors are also looking into this.

rappoccio commented 6 years ago

Implemented in https://github.com/cms-externals/fastjet-contrib/pull/13. I think this issue can be closed for now.

slava77 commented 6 years ago

Implemented in cms-externals/fastjet-contrib#13. I think this issue can be closed for now.

I'd keep it open until we can actually enable it in the Phase-2 workflows.

slava77 commented 5 years ago

I'd keep it open until we can actually enable it in the Phase-2 workflows.

@rappoccio @kpedro88 where are we with re-enabling ECFs in phase-2?

kpedro88 commented 5 years ago

My suggestions earlier in this thread were implemented by the fastjet-contrib authors. This made the ECF calculation much faster, but I doubt it is fast enough to run on the full collection of jet constituents in 200PU. I do not think @rappoccio's suggestions of reducing the number of constituents has been tested in CMSSW yet.

srimanob commented 3 years ago

Do we need to progress on this, or we are waiting for something? Testing nb[1,2] on Phase-2 PU200 with 12_1_0_pre3 seems OK. The timing is not significant, compared to RECO.

Thanks.

slava77 commented 3 years ago

are the ECFs still widely used? It seemed like a popular topic back in 2017, but I'm not sure where they are now. @rappoccio do you know?

kpedro88 commented 3 years ago

ECFs are definitely still used. It would be useful to do a clear benchmark of the speed of this calculation in different releases with 200PU before closing this issue.

srimanob commented 2 years ago

Somehow, this issue was missing on my to-do list.

Testing with PAT step using 100 ttbar events of PU200 from RelVals 12_3_0_pre5. In general, time/event increase by 0.4 sec. Note that, the test use 1 core job.

Disable:

    <Metric Name="AvgEventTime" Value="4.34709"/>
    <Metric Name="MaxEventTime" Value="30.3578"/>
    <Metric Name="MinEventTime" Value="2.54537"/>

TimeReport   0.000012     0.000012     0.000012  nb1AK8PuppiSoftDrop
TimeReport   0.000023     0.000023     0.000023  nb1AK8PuppiSoftDropSubjets
TimeReport   0.000007     0.000007     0.000007  nb2AK8PuppiSoftDrop
TimeReport   0.000008     0.000008     0.000008  nb2AK8PuppiSoftDropSubjets

And when I enable:

    <Metric Name="AvgEventTime" Value="4.38679"/>
    <Metric Name="MaxEventTime" Value="29.8561"/>
    <Metric Name="MinEventTime" Value="2.56959"/>

TimeReport   0.020377     0.020377     0.020377  nb1AK8PuppiSoftDrop
TimeReport   0.006059     0.006059     0.006059  nb1AK8PuppiSoftDropSubjets
TimeReport   0.019707     0.019707     0.019707  nb2AK8PuppiSoftDrop
TimeReport   0.005561     0.005561     0.005561  nb2AK8PuppiSoftDropSubjets

Should we do something else (e.g. igprof -pp) to benchmark this? I don't know at which level we will say it is good enough. It seems to be have a big improve, compare to the TimeReport in the original report.

kpedro88 commented 2 years ago

@srimanob these algorithms currently take no time because they're effectively disabled for phase 2: https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/PhysicsTools/PatAlgos/python/slimming/applySubstructure_cff.py#L89-L110

This of course means that their output is not available for phase 2 studies, which we'd ultimately like to change.

srimanob commented 2 years ago

Hi @kpedro88 I enabled it before measuring, and compare with default one (disable). That is the result under "Enable". I've updated my comment above to make it clear.

kpedro88 commented 2 years ago

Ah, sorry, I missed that.

I guess we probably don't still have 200PU relvals from 9_3_X to make sure it's an apples-to-apples comparison, but the current performance you report looks reasonable.

srimanob commented 2 years ago

PdmV has the policy to keep the last relvals. We still have it at FNAL disk as I see from DAS, site dataset=/RelValTTbar_14TeV/CMSSW_9_3_9_patch1-PU25ns_93X_upgrade2023_realistic_v5_2023D17PU200-v1/GEN-SIM-RECO Checking it.

srimanob commented 2 years ago

It's a bit complicated to run 9_3 RAW on 9_4, so I make 100 events of ttbar D17 PU200 in 9_4 and measure. Here is the result:

Disable (default):

    <Metric Name="AvgEventTime" Value="16.4811"/>
    <Metric Name="MaxEventTime" Value="30.9948"/>
    <Metric Name="MinEventTime" Value="9.65991"/>

Enable:

    <Metric Name="AvgEventTime" Value="40.6082"/>
    <Metric Name="MaxEventTime" Value="2439.59"/>
    <Metric Name="MinEventTime" Value="9.65263"/>

TimeReport   8.360630     8.360630     8.360630  nb1AK8PuppiSoftDrop
TimeReport   3.786138     3.786138     3.786138  nb1AK8PuppiSoftDropSubjets
TimeReport   8.355339     8.355339     8.355339  nb2AK8PuppiSoftDrop
TimeReport   3.784009     3.784009     3.784009  nb2AK8PuppiSoftDropSubjets

It seems a lot. However if we look on top-20 time module in each event, we see clearly that there is a very heavy event, and very few heavy events.

TimeModule> 14 1 nb1AK8PuppiSoftDrop ECFAdder 832.261
TimeModule> 57 1 nb1AK8PuppiSoftDrop ECFAdder 1.58668
TimeModule> 44 1 nb1AK8PuppiSoftDrop ECFAdder 1.10515
TimeModule> 12 1 nb1AK8PuppiSoftDrop ECFAdder 0.695914
TimeModule> 53 1 nb1AK8PuppiSoftDrop ECFAdder 0.243108
TimeModule> 37 1 nb1AK8PuppiSoftDrop ECFAdder 0.070987
TimeModule> 5 1 nb1AK8PuppiSoftDrop ECFAdder 0.068325
TimeModule> 19 1 nb1AK8PuppiSoftDrop ECFAdder 0.0225348
TimeModule> 77 1 nb1AK8PuppiSoftDrop ECFAdder 0.00560379
TimeModule> 93 1 nb1AK8PuppiSoftDrop ECFAdder 0.00252008
TimeModule> 58 1 nb1AK8PuppiSoftDrop ECFAdder 0.000473976
TimeModule> 2 1 nb1AK8PuppiSoftDrop ECFAdder 9.29832e-05
TimeModule> 4 1 nb1AK8PuppiSoftDrop ECFAdder 5.88894e-05
TimeModule> 78 1 nb1AK8PuppiSoftDrop ECFAdder 4.69685e-05
TimeModule> 87 1 nb1AK8PuppiSoftDrop ECFAdder 2.47955e-05
TimeModule> 17 1 nb1AK8PuppiSoftDrop ECFAdder 2.28882e-05
TimeModule> 80 1 nb1AK8PuppiSoftDrop ECFAdder 2.19345e-05
TimeModule> 90 1 nb1AK8PuppiSoftDrop ECFAdder 2.00272e-05
TimeModule> 61 1 nb1AK8PuppiSoftDrop ECFAdder 1.90735e-05
TimeModule> 92 1 nb1AK8PuppiSoftDrop ECFAdder 1.81198e-05

Compare to what I see in 12_3_0_pre5 (comments), it seems time per event in general is larger than 9_4.

TimeModule> 832 1 nb1AK8PuppiSoftDrop ECFAdder 0.0858278
TimeModule> 835 1 nb1AK8PuppiSoftDrop ECFAdder 0.0790851
TimeModule> 840 1 nb1AK8PuppiSoftDrop ECFAdder 0.051811
TimeModule> 878 1 nb1AK8PuppiSoftDrop ECFAdder 0.048425
TimeModule> 822 1 nb1AK8PuppiSoftDrop ECFAdder 0.042285
TimeModule> 888 1 nb1AK8PuppiSoftDrop ECFAdder 0.041904
TimeModule> 841 1 nb1AK8PuppiSoftDrop ECFAdder 0.0408938
TimeModule> 802 1 nb1AK8PuppiSoftDrop ECFAdder 0.0406702
TimeModule> 845 1 nb1AK8PuppiSoftDrop ECFAdder 0.040055
TimeModule> 898 1 nb1AK8PuppiSoftDrop ECFAdder 0.0378761
TimeModule> 828 1 nb1AK8PuppiSoftDrop ECFAdder 0.037765
TimeModule> 839 1 nb1AK8PuppiSoftDrop ECFAdder 0.037328
TimeModule> 879 1 nb1AK8PuppiSoftDrop ECFAdder 0.0365169
TimeModule> 874 1 nb1AK8PuppiSoftDrop ECFAdder 0.036432
TimeModule> 852 1 nb1AK8PuppiSoftDrop ECFAdder 0.0336301
TimeModule> 871 1 nb1AK8PuppiSoftDrop ECFAdder 0.0335531
TimeModule> 876 1 nb1AK8PuppiSoftDrop ECFAdder 0.033473
TimeModule> 817 1 nb1AK8PuppiSoftDrop ECFAdder 0.0329709
TimeModule> 803 1 nb1AK8PuppiSoftDrop ECFAdder 0.032969
TimeModule> 881 1 nb1AK8PuppiSoftDrop ECFAdder 0.0327802

It is maybe not apple-apple comparison in this case, considering geometry, reco part. Maybe what we should do is to measure from PU200 RelVals (9k events) the PAT step and see if timing is more stable and can be excepted or not.

srimanob commented 1 year ago

+Upgrade

https://github.com/cms-sw/cmssw/pull/42158 is merged.

smuzaffar commented 10 months ago

@perrotta @cms-sw/reconstruction-l2 , was https://github.com/cms-sw/cmssw/pull/42158 enough to address this issue? Can we close it?

perrotta commented 10 months ago

@perrotta @cms-sw/reconstruction-l2 , was #42158 enough to address this issue? Can we close it?

I would say so, "+1" from me

cms-sw / cmssw

Optimize new Energy Correlation Functions from fastjet #20558