cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.32k forks source link

reduce ML inference time in b-tag related jet taggers #25230

Closed slava77 closed 3 years ago

slava77 commented 5 years ago

with #24918 included, b-tag-related ML inference is at around 20% of the miniAOD processing time

Here are the averages (excluding peak [initialization?] time) from 1K events in workflow 136.8311 running with this PR on top of 10_4_0_pre1

   0.33 pfDeepCSVJetTagsAK8PFPuppiSoftDropSubjets          DeepFlavourJetTagsProducer
   0.41            pfDeepCSVJetTagsAK8Puppi          DeepFlavourJetTagsProducer 
   0.94               pfDeepCSVJetTagsPuppi          DeepFlavourJetTagsProducer
  16.01 pfDeepDoubleCvLJetTagsSlimmedAK8DeepTags        DeepDoubleXTFJetTagsProducer
  16.09 pfDeepDoubleBvLJetTagsSlimmedAK8DeepTags        DeepDoubleXTFJetTagsProducer
  16.10 pfDeepDoubleCvBJetTagsSlimmedAK8DeepTags        DeepDoubleXTFJetTagsProducer
  16.20 pfMassIndependentDeepDoubleCvLJetTagsSlimmedAK8DeepTags        DeepDoubleXTFJetTagsProducer
  16.58 pfMassIndependentDeepDoubleCvBJetTagsSlimmedAK8DeepTags        DeepDoubleXTFJetTagsProducer
  28.86 pfDeepBoostedJetTagsSlimmedAK8DeepTags          DeepBoostedJetTagsProducer
  29.72 pfMassDecorrelatedDeepBoostedJetTagsSlimmedAK8DeepTags          DeepBoostedJetTagsProducer
 153.85 pfDeepFlavourJetTagsSlimmedDeepFlavour        DeepFlavourTFJetTagsProducer
Total from above: 295.09 ms/ev

This issue is to keep track of the progress, aiming for a reduction. The next BTV-related jet-tag update should not increase the total time.

@ferencek @mverzett @kskovpen @andrzejnovak @jmduarte @hqucms

slava77 commented 5 years ago

assign reconstruction

cmsbuild commented 5 years ago

New categories assigned: reconstruction

@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 5 years ago

A new Issue was created by @slava77 Slava Krutelyov.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

kpedro88 commented 5 years ago

For miniAOD, one way to reduce the total time used by ML tagging algorithms is to run them only on the jets for which the results will actually be used/stored. This is relevant for AK8 jets, since the low-pt ones are not stored. My understanding is that DeepAK8 has an enforced pT cut in the tagger, while DeepDoubleB does not, but maybe this is out of date.

Tagging a few more people: @gouskos @rappoccio

rappoccio commented 5 years ago

That's a good point. The same preselection could be used, and only run this on the constituents of a preclustered sample with a pt cut on the jet.

kpedro88 commented 5 years ago

Also, @violatingcp found that for CNN inference, TensorFlow v1.10 is 2.4x faster than TensorFlow v1.6 (which we're currently using).

slava77 commented 5 years ago

On 11/15/18 8:23 AM, Kevin Pedro wrote:

Also, @violatingcp found that for CNN inference, TensorFlow v1.10 is 2.4x faster than TensorFlow v1.6 (which we're currently using).

Is this based on similarly built version? Recall that the CMSSW production version is compiled with SSE3 and does not use AVX/AVX512 instructions.

violatingcp commented 5 years ago

Ah, so TF v1.6 used the CMSSW standard version. Indeed it looks like the tf version I am using is using AVX, but not AVX2. I can disable and try to run.

slava77 commented 5 years ago

@ferencek @mverzett @kskovpen @andrzejnovak @jmduarte @hqucms was there any significant progress on this topic of speeding up ML b-tags in the recent months? Please comment. Thank you.

mverzett commented 5 years ago

Not really, we are looking into ways to reduce the input size and eventually the graph size, but we are a long way from a usable solution.

hqucms commented 5 years ago

@slava77 In some recent studies, we found that there is a significant difference (4-5x) between standalone MXNet and MXNet in CMSSW in the inference speed. After some investigation, the cause seems to be the linking with the BLAS library. The standalone MXNet is statically linked to OpenBLAS. The one in CMSSW, despite being dynamically linked to OpenBLAS [0], seems not calling into OpenBLAS when running inside cmsRun.

A simple test to show this:

[Test setup] Using CMSSW_10_6_1 on lxplus7 Python cfg: RecoBTag/MXNet/test/test_deep_boosted_jet_cfg.py Input file: /store/mc/RunIIFall17MiniAODv2/ZprimeToTT_M1000_W10_TuneCP2_13TeV-madgraphMLM-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14_ext1-v1/90000/0C1D564B-DB28-E911-912C-0CC47A4D76D0.root (copied to local)

> cmsRun test_deep_boosted_jet_cfg.py inputFiles=file:TT-M1000.root maxEvents=1000 >& orig.log &                                                                                                            
TimeReport   0.020148     0.020148     0.020148  pfDeepBoostedJetTags
TimeReport   0.018684     0.018684     0.018684  pfMassDecorrelatedDeepBoostedJetTags

> LD_PRELOAD=/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/OpenBLAS/0.3.5/lib/libopenblas.so cmsRun test_deep_boosted_jet_cfg.py inputFiles=file:TT-M1000.root maxEvents=1000 >& preload_openblas_cmssw.log &
TimeReport   0.005091     0.005091     0.005091  pfDeepBoostedJetTags
TimeReport   0.004962     0.004962     0.004962  pfMassDecorrelatedDeepBoostedJetTags

When using LD_PRELOAD we saw a speedup of ~4x, roughly consistent with standalone MXNet now.

(BTW, this still leave ~20-30% difference, which is likely because the CMSSW OpenBLAS is compiled with DYNAMIC_ARCH=0 therefore not able to exploit newer instructions eg AVX available on the machine -- maybe we should also consider enabling DYNAMIC_ARCH?)

A search in the external libraries reveal that there are two other libraries that can provide the same cblas functions as OpenBLAS:

/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_1/external/slc7_amd64_gcc700/lib/libcblas.so
0000000000007c44 T cblas_sgemm

/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_1/external/slc7_amd64_gcc700/lib/libgslcblas.so
0000000000002c10 T cblas_sgemm

/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_1/external/slc7_amd64_gcc700/lib/libopenblas.so
000000000008e010 T cblas_sgemm

It seems that one of them is loaded before OpenBLAS and therefore hiding the OpenBLAS symbols. Any suggestion on how to fix this?

Thank you!

[0]

> ldd /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/mxnet-predict/1.2.1-pafccj2/lib/libmxnetpredict.so 
        linux-vdso.so.1 =>  (0x00007ffd9af8a000)
        libopenblas.so.0 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_1/external/slc7_amd64_gcc700/lib/libopenblas.so.0 (0x00007f9a63d0e000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f9a63b06000)
        libstdc++.so.6 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gcc/7.0.0-pafccj/lib64/libstdc++.so.6 (0x00007f9a674df000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f9a63804000)
        libgcc_s.so.1 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gcc/7.0.0-pafccj/lib64/libgcc_s.so.1 (0x00007f9a674c6000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9a635e8000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f9a6321b000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9a67471000)
        libgfortran.so.4 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gcc/7.0.0-pafccj/lib64/libgfortran.so.4 (0x00007f9a63048000)
        libquadmath.so.0 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gcc/7.0.0-pafccj/lib64/libquadmath.so.0 (0x00007f9a63008000)
slava77 commented 5 years ago

@mrodozov @smuzaffar @fabiocos please take a look and comment on https://github.com/cms-sw/cmssw/issues/25230#issuecomment-506422754 It looks like we have multiple libraries providing the symbols and perhaps the libopenblas.so should replace the rest

smuzaffar commented 5 years ago

@slava77 , looks like root MathMore library (which was loaded first) is linked against gslcblas. We are lokking in to root configuration to see if it can use openblas.

mrodozov commented 5 years ago

Input file: /store/mc/RunIIFall17MiniAODv2/ZprimeToTT_M1000_W10_TuneCP2_13TeV-madgraphMLM-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14_ext1-v1/90000/0C1D564B-DB28-E911-912C-0CC47A4D76D0.root

@hqucms File is not there anymore, would you put it somewhere on afs or any other place we can copy it so we can test with CMSSW pls ? Thanks

hqucms commented 5 years ago

Thank you for looking into this! @mrodozov I copied the file to my EOS space: /eos/cms/store/user/hqu/forTesting/TT-M1000.root

smuzaffar commented 5 years ago

https://github.com/cms-sw/cmsdist/pull/5063 should allow us to link against openblas instead of gslcblas. Once integrated in IB then we will inform you so that you can re-run the tests

smuzaffar commented 5 years ago

@slava77 and @hqucms , running this test with cms-sw/cmsdist#5063 I get these numbers

TimeReport 0.003751 0.003751 0.003751 pfDeepBoostedJetTags TimeReport 0.003788 0.003788 0.003788 pfMassDecorrelatedDeepBoostedJetTags

@slava77 , if you need to test it then I can include it in GCC 8 IBs.

kpedro88 commented 5 years ago

It would be great to backport this to 10_6_X so UL reco and analysis can profit from the speed improvement.

slava77 commented 5 years ago

tests in cms-sw/cmsdist#5063 showed a small regression in some b-tag discriminant. We need to understand the origins of that before merging to the master branch or backporting.

hqucms commented 5 years ago

@slava77 and @hqucms , running this test with cms-sw/cmsdist#5063 I get these numbers

TimeReport 0.003751 0.003751 0.003751 pfDeepBoostedJetTags TimeReport 0.003788 0.003788 0.003788 pfMassDecorrelatedDeepBoostedJetTags

Thank you @smuzaffar ! These numbers are consistent with my tests -- they are better than the ones I quote in https://github.com/cms-sw/cmssw/issues/25230#issuecomment-506422754, which I think is due to the enabling of DYNAMIC_ARCH.

fabiocos commented 5 years ago

@kpedro88 @slava77 as the production for 2017 UL is about to start, if this is not merely technical we need to understand up to which point we may afford to backport things

slava77 commented 5 years ago

@kpedro88 @slava77 as the production for 2017 UL is about to start, if this is not merely technical we need to understand up to which point we may afford to backport things

The impact of cms-sw/cmsdist#5063 is small for the UL. The backports may be more important for the releases where these taggers have to run in nano.

hqucms commented 5 years ago

Another thing: I came across a fairly new ML inference engine called ONNX Runtime recently and did some tests with it. It seems that it can bring another ~1.5x speed-up compared to MXNet+OpenBLAS for DeepAK8. More interestingly, it seems to bring ~3-5x speed-up for the AK4 DeepJet model compared to TensorFlow. Might be interesting to get this into CMSSW.

fabiocos commented 5 years ago

@hqucms I think that we are mixing different things in this thread:

davidlange6 commented 5 years ago

i'll look at adding the onnx runtime tool to cmssw - looks interesting... if it would be a way to avoid the non-supported use of the tensorflow c++ interface that would be great (and if its faster, thats all the better, of course one has to do a comparison with like-built libraries, which can make a big difference)

On Jul 8, 2019, at 7:25 AM, Fabio Cossutti notifications@github.com wrote:

@hqucms I think that we are mixing different things in this thread:

• effect of revised BLAS implementation used by MXNet: which is the outcome? Do we understand the regression? Is a backport safe in this respect?

• possible alternative engines to MXNet: this is a new development, that may be of course tested upon agreement with the RECO group, but I see it as a separate issue

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

hqucms commented 5 years ago

@davidlange6 I can make a PR to cmsdist since I have already tested it a bit. Also per @fabiocos's request I opened a new issue (https://github.com/cms-sw/cmssw/issues/27458) regarding ONNX runtime to avoid mixing up two fronts.

davidlange6 commented 5 years ago

ah - that would be great. thanks.

On Jul 8, 2019, at 3:16 PM, Huilin Qu notifications@github.com wrote:

@davidlange6 I can make a PR to cmsdist since I have already tested it a bit. Also per @fabiocos's request I opened a new issue (#27458) regarding ONNX runtime to avoid mixing up two fronts.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

slava77 commented 5 years ago

@hqucms I think that we are mixing different things in this thread:

@fabiocos this is a generic thread, opened in November last year. Actually, back then there was some promising development unrelated to either the openblas or the ONNX, which did not materialize with a solution.

fabiocos commented 5 years ago

@slava77 you are correct, but at some point the discussing was about the regression due to cms-sw/cmsdist#5063 and I believe it would be good to keep that issue the whatever other development distinct, just that

slava77 commented 5 years ago

@slava77 you are correct, but at some point the discussing was about the regression due to cms-sw/cmsdist#5063 and I believe it would be good to keep that issue the whatever other development distinct, just that

So, it looks like this regression is an indication of a random changes in the algorithm also seen in https://github.com/cms-sw/cmssw/pull/27488#issuecomment-510675758 and https://github.com/cms-sw/cmssw/pull/27428#issuecomment-510640346

I created #27504 to keep track of the problem

slava77 commented 3 years ago

this could partly be closed now after ONNX inference was implemented ( #28112; via #27458). However, the total cost is still fairly large after other additions.

Since new inference engines showed up since the late 2018 when this issue was opened, I will reopen a fresh one (#32883) and close this one.