Dealing with RooFit vectorization target

makortel commented 8 months ago

In ROOT master the a new vectorizing CPU evaluation was made the default backend in RooFit (see https://github.com/cms-sw/cmsdist/pull/9034#issuecomment-1976310707). By default RooFit has a selection logic based on the capabilities of the CPU (between generic, SSE4.1, AVX, AVX2, AVX-512). We should discuss (at least in a future Core Software meeting) and decide how we want to deal with the RooFit's vectorized backends in CMS.

makortel commented 8 months ago

assign core

makortel commented 8 months ago

type root

cmsbuild commented 8 months ago

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 8 months ago

cms-bot internal usage

cmsbuild commented 8 months ago

A new Issue was created by @makortel.

@makortel, @Dr15Jones, @antoniovilela, @smuzaffar, @sextonkennedy, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 8 months ago

Copying over comments in https://github.com/cms-sw/cmsdist/pull/9034#issuecomment-1976310707

By @guitargeek

... making the new vectorizing CPU evaluation backend in RooFit the default.

The latter will have a big impact on the users, speeding RooFit likelihood minimizations up by up to a factor 10. The new evaluation backend was carefully validated in the last years, and I have fixed all problems I was aware of. ... The reason why AVX2 code is executed, is because RooFit ships with the evaluation library compiled multiple times for different SIMD instruction sets. Then at runtime, RooFit dynamically loads the fastest version of the library that is supported by the CPU: https://github.com/root-project/root/blob/master/roofit/batchcompute/src/Initialisation.cxx#L68

In that logic, AVX is preferred over SSE. Is that a problem for CMSSW?

Reply by @makortel

The reason why AVX2 code is executed, is because RooFit ships with the evaluation library compiled multiple times for different SIMD instruction sets. Then at runtime, RooFit dynamically loads the fastest version of the library that is supported by the CPU: https://github.com/root-project/root/blob/master/roofit/batchcompute/src/Initialisation.cxx#L68 In that logic, AVX is preferred over SSE. Is that a problem for CMSSW?

We would generally want to be in full control of the vectorization target (or as close as we can get). Our baseline is still SSE3, but there is work ongoing towards deploying a "multi-architecture" build of CMSSW (plus some select externals), some more information in cms-sw/cmssw#43652.

We have some exceptions to this general approach

Tensorflow (and I believe also ONNX) are allowed to use their more dynamic mechanisms for wider-than-sse3 vectorization targets

We don't try to prevent any dynamic behavior of glibc

With Tensorflow we have had quite some trouble, mostly but not only in special cases (some of the story is recorded in cms-sw/cmssw#42444 and other issues linked there). On a somewhat related note cms-sw/cmssw#44188 shows some "fun" we are currently dealing with Eigen (I hope is not very relevant for our use of ROOT).

I see there is already a way for a user to select the target binary, so minimally we could use that. Do I understand correctly that SSE3 would correspond to generic?

I'm quite sure CMS would e.g. want to skip the original AVX implementation because of the frequency scaling behavior of that era of CPUs.

Anyway, I think in CMS we need to discuss more how we want to deal with the by-default dynamic behavior of RooFit. What kind of guarantees for reproducibility of the fit results does RooFit give between different vectorization targets?

cms-sw / cmssw

Dealing with RooFit vectorization target #44308