Closed slava77 closed 3 years ago
assign reconstruction
New categories assigned: reconstruction
@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @slava77 Slava Krutelyov.
@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
For miniAOD, one way to reduce the total time used by ML tagging algorithms is to run them only on the jets for which the results will actually be used/stored. This is relevant for AK8 jets, since the low-pt ones are not stored. My understanding is that DeepAK8 has an enforced pT cut in the tagger, while DeepDoubleB does not, but maybe this is out of date.
Tagging a few more people: @gouskos @rappoccio
That's a good point. The same preselection could be used, and only run this on the constituents of a preclustered sample with a pt cut on the jet.
Also, @violatingcp found that for CNN inference, TensorFlow v1.10 is 2.4x faster than TensorFlow v1.6 (which we're currently using).
On 11/15/18 8:23 AM, Kevin Pedro wrote:
Also, @violatingcp found that for CNN inference, TensorFlow v1.10 is 2.4x faster than TensorFlow v1.6 (which we're currently using).
Is this based on similarly built version? Recall that the CMSSW production version is compiled with SSE3 and does not use AVX/AVX512 instructions.
Ah, so TF v1.6 used the CMSSW standard version. Indeed it looks like the tf version I am using is using AVX, but not AVX2. I can disable and try to run.
@ferencek @mverzett @kskovpen @andrzejnovak @jmduarte @hqucms was there any significant progress on this topic of speeding up ML b-tags in the recent months? Please comment. Thank you.
Not really, we are looking into ways to reduce the input size and eventually the graph size, but we are a long way from a usable solution.
@slava77 In some recent studies, we found that there is a significant difference (4-5x) between standalone MXNet and MXNet in CMSSW in the inference speed. After some investigation, the cause seems to be the linking with the BLAS library. The standalone MXNet is statically linked to OpenBLAS. The one in CMSSW, despite being dynamically linked to OpenBLAS [0], seems not calling into OpenBLAS when running inside cmsRun.
A simple test to show this:
[Test setup]
Using CMSSW_10_6_1
on lxplus7
Python cfg: RecoBTag/MXNet/test/test_deep_boosted_jet_cfg.py
Input file: /store/mc/RunIIFall17MiniAODv2/ZprimeToTT_M1000_W10_TuneCP2_13TeV-madgraphMLM-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14_ext1-v1/90000/0C1D564B-DB28-E911-912C-0CC47A4D76D0.root
(copied to local)
> cmsRun test_deep_boosted_jet_cfg.py inputFiles=file:TT-M1000.root maxEvents=1000 >& orig.log &
TimeReport 0.020148 0.020148 0.020148 pfDeepBoostedJetTags
TimeReport 0.018684 0.018684 0.018684 pfMassDecorrelatedDeepBoostedJetTags
> LD_PRELOAD=/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/OpenBLAS/0.3.5/lib/libopenblas.so cmsRun test_deep_boosted_jet_cfg.py inputFiles=file:TT-M1000.root maxEvents=1000 >& preload_openblas_cmssw.log &
TimeReport 0.005091 0.005091 0.005091 pfDeepBoostedJetTags
TimeReport 0.004962 0.004962 0.004962 pfMassDecorrelatedDeepBoostedJetTags
When using LD_PRELOAD
we saw a speedup of ~4x, roughly consistent with standalone MXNet now.
(BTW, this still leave ~20-30% difference, which is likely because the CMSSW OpenBLAS is compiled with DYNAMIC_ARCH=0
therefore not able to exploit newer instructions eg AVX available on the machine -- maybe we should also consider enabling DYNAMIC_ARCH
?)
A search in the external libraries reveal that there are two other libraries that can provide the same cblas functions as OpenBLAS:
/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_1/external/slc7_amd64_gcc700/lib/libcblas.so
0000000000007c44 T cblas_sgemm
/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_1/external/slc7_amd64_gcc700/lib/libgslcblas.so
0000000000002c10 T cblas_sgemm
/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_1/external/slc7_amd64_gcc700/lib/libopenblas.so
000000000008e010 T cblas_sgemm
It seems that one of them is loaded before OpenBLAS and therefore hiding the OpenBLAS symbols. Any suggestion on how to fix this?
Thank you!
[0]
> ldd /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/mxnet-predict/1.2.1-pafccj2/lib/libmxnetpredict.so
linux-vdso.so.1 => (0x00007ffd9af8a000)
libopenblas.so.0 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_1/external/slc7_amd64_gcc700/lib/libopenblas.so.0 (0x00007f9a63d0e000)
librt.so.1 => /lib64/librt.so.1 (0x00007f9a63b06000)
libstdc++.so.6 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gcc/7.0.0-pafccj/lib64/libstdc++.so.6 (0x00007f9a674df000)
libm.so.6 => /lib64/libm.so.6 (0x00007f9a63804000)
libgcc_s.so.1 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gcc/7.0.0-pafccj/lib64/libgcc_s.so.1 (0x00007f9a674c6000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9a635e8000)
libc.so.6 => /lib64/libc.so.6 (0x00007f9a6321b000)
/lib64/ld-linux-x86-64.so.2 (0x00007f9a67471000)
libgfortran.so.4 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gcc/7.0.0-pafccj/lib64/libgfortran.so.4 (0x00007f9a63048000)
libquadmath.so.0 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gcc/7.0.0-pafccj/lib64/libquadmath.so.0 (0x00007f9a63008000)
@mrodozov @smuzaffar @fabiocos please take a look and comment on https://github.com/cms-sw/cmssw/issues/25230#issuecomment-506422754 It looks like we have multiple libraries providing the symbols and perhaps the libopenblas.so should replace the rest
@slava77 , looks like root MathMore library (which was loaded first) is linked against gslcblas. We are lokking in to root configuration to see if it can use openblas.
Input file: /store/mc/RunIIFall17MiniAODv2/ZprimeToTT_M1000_W10_TuneCP2_13TeV-madgraphMLM-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14_ext1-v1/90000/0C1D564B-DB28-E911-912C-0CC47A4D76D0.root
@hqucms File is not there anymore, would you put it somewhere on afs or any other place we can copy it so we can test with CMSSW pls ? Thanks
Thank you for looking into this! @mrodozov I copied the file to my EOS space: /eos/cms/store/user/hqu/forTesting/TT-M1000.root
https://github.com/cms-sw/cmsdist/pull/5063 should allow us to link against openblas
instead of gslcblas
. Once integrated in IB then we will inform you so that you can re-run the tests
@slava77 and @hqucms , running this test with cms-sw/cmsdist#5063 I get these numbers
TimeReport 0.003751 0.003751 0.003751 pfDeepBoostedJetTags TimeReport 0.003788 0.003788 0.003788 pfMassDecorrelatedDeepBoostedJetTags
@slava77 , if you need to test it then I can include it in GCC 8 IBs.
It would be great to backport this to 10_6_X so UL reco and analysis can profit from the speed improvement.
tests in cms-sw/cmsdist#5063 showed a small regression in some b-tag discriminant. We need to understand the origins of that before merging to the master branch or backporting.
@slava77 and @hqucms , running this test with cms-sw/cmsdist#5063 I get these numbers
TimeReport 0.003751 0.003751 0.003751 pfDeepBoostedJetTags TimeReport 0.003788 0.003788 0.003788 pfMassDecorrelatedDeepBoostedJetTags
Thank you @smuzaffar !
These numbers are consistent with my tests -- they are better than the ones I quote in https://github.com/cms-sw/cmssw/issues/25230#issuecomment-506422754, which I think is due to the enabling of DYNAMIC_ARCH
.
@kpedro88 @slava77 as the production for 2017 UL is about to start, if this is not merely technical we need to understand up to which point we may afford to backport things
@kpedro88 @slava77 as the production for 2017 UL is about to start, if this is not merely technical we need to understand up to which point we may afford to backport things
The impact of cms-sw/cmsdist#5063 is small for the UL. The backports may be more important for the releases where these taggers have to run in nano.
Another thing: I came across a fairly new ML inference engine called ONNX Runtime recently and did some tests with it. It seems that it can bring another ~1.5x speed-up compared to MXNet+OpenBLAS for DeepAK8. More interestingly, it seems to bring ~3-5x speed-up for the AK4 DeepJet model compared to TensorFlow. Might be interesting to get this into CMSSW.
@hqucms I think that we are mixing different things in this thread:
effect of revised BLAS implementation used by MXNet: which is the outcome? Do we understand the regression? Is a backport safe in this respect?
possible alternative engines to MXNet: this is a new development, that may be of course tested upon agreement with the RECO group, but I see it as a separate issue
i'll look at adding the onnx runtime tool to cmssw - looks interesting... if it would be a way to avoid the non-supported use of the tensorflow c++ interface that would be great (and if its faster, thats all the better, of course one has to do a comparison with like-built libraries, which can make a big difference)
On Jul 8, 2019, at 7:25 AM, Fabio Cossutti notifications@github.com wrote:
@hqucms I think that we are mixing different things in this thread:
• effect of revised BLAS implementation used by MXNet: which is the outcome? Do we understand the regression? Is a backport safe in this respect?
• possible alternative engines to MXNet: this is a new development, that may be of course tested upon agreement with the RECO group, but I see it as a separate issue
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@davidlange6 I can make a PR to cmsdist since I have already tested it a bit. Also per @fabiocos's request I opened a new issue (https://github.com/cms-sw/cmssw/issues/27458) regarding ONNX runtime to avoid mixing up two fronts.
ah - that would be great. thanks.
On Jul 8, 2019, at 3:16 PM, Huilin Qu notifications@github.com wrote:
@davidlange6 I can make a PR to cmsdist since I have already tested it a bit. Also per @fabiocos's request I opened a new issue (#27458) regarding ONNX runtime to avoid mixing up two fronts.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@hqucms I think that we are mixing different things in this thread:
@fabiocos this is a generic thread, opened in November last year. Actually, back then there was some promising development unrelated to either the openblas or the ONNX, which did not materialize with a solution.
@slava77 you are correct, but at some point the discussing was about the regression due to cms-sw/cmsdist#5063 and I believe it would be good to keep that issue the whatever other development distinct, just that
@slava77 you are correct, but at some point the discussing was about the regression due to cms-sw/cmsdist#5063 and I believe it would be good to keep that issue the whatever other development distinct, just that
So, it looks like this regression is an indication of a random changes in the algorithm also seen in https://github.com/cms-sw/cmssw/pull/27488#issuecomment-510675758 and https://github.com/cms-sw/cmssw/pull/27428#issuecomment-510640346
I created #27504 to keep track of the problem
this could partly be closed now after ONNX inference was implemented ( #28112; via #27458). However, the total cost is still fairly large after other additions.
Since new inference engines showed up since the late 2018 when this issue was opened, I will reopen a fresh one (#32883) and close this one.
with #24918 included, b-tag-related ML inference is at around 20% of the miniAOD processing time
Here are the averages (excluding peak [initialization?] time) from 1K events in workflow 136.8311 running with this PR on top of 10_4_0_pre1
This issue is to keep track of the progress, aiming for a reduction. The next BTV-related jet-tag update should not increase the total time.
@ferencek @mverzett @kskovpen @andrzejnovak @jmduarte @hqucms