smorovic commented 3 months ago

As discussed in PR https://github.com/cms-sw/cmssw/pull/44473 we noticed discrepancy in XGBoost inference result with the new unit test RecoEgamma/PhotonIdentification/test/test_PhotonMvaXgb.cc. Unit test passes on x86_64, but fails in identical fashion and with identical discrepancies on both PPC64 LE and ARM 64, happening in 4 out of 10 tests:

src/RecoEgamma/PhotonIdentification/test/test_PhotonMvaXgb.cc:44: FAILED:
  CHECK_THAT( xgbScore, Catch::Matchers::WithinAbs(mva_score_v1[i], 0.0001) )
with expansion:
  0.91074f is within 0.0001 of 0.9863399863

src/RecoEgamma/PhotonIdentification/test/test_PhotonMvaXgb.cc:44: FAILED:
  CHECK_THAT( xgbScore, Catch::Matchers::WithinAbs(mva_score_v1[i], 0.0001) )
with expansion:
  0.82656f is within 0.0001 of 0.9750099778

src/RecoEgamma/PhotonIdentification/test/test_PhotonMvaXgb.cc:44: FAILED:
  CHECK_THAT( xgbScore, Catch::Matchers::WithinAbs(mva_score_v1[i], 0.0001) )
with expansion:
  0.0f is within 0.0001 of 0.00179

src/RecoEgamma/PhotonIdentification/test/test_PhotonMvaXgb.cc:44: FAILED:
  CHECK_THAT( xgbScore, Catch::Matchers::WithinAbs(mva_score_v1[i], 0.0001) )
with expansion:
  0.93808f is within 0.0001 of 0.9837399721

PR was submitted to disable the check on non-x86_64 for now: https://github.com/cms-sw/cmssw/pull/44531

cmsbuild commented 3 months ago

cms-bot internal usage

cmsbuild commented 3 months ago

A new Issue was created by @smorovic.

@makortel, @Dr15Jones, @smuzaffar, @antoniovilela, @sextonkennedy, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 3 months ago

assign reconstruction, ml

cmsbuild commented 3 months ago

New categories assigned: reconstruction,ml

@jfernan2,@mandrenguyen,@valsdav,@wpmccormack you have been requested to review this Pull request/Issue and eventually sign? Thanks

smorovic commented 3 months ago

I fetched xgboost v1.7.5 (and for the subpackage dmlc-core master branch) from github and compiled with default options (cmake . ; make). On x86_64 RHEL8 (gcc8) as well as lxplus9-arm (gcc11) it gives the same result for the test, which is consistent with the first check in the unit test.

#include <stdio.h>
#include <string>
#include <sstream>
#include <iostream>
extern "C" {
#include "xgboost/c_api.h"
}

int main() {

  int best_ntree_limit_ = 158;
  BoosterHandle booster;
  XGBoosterCreate(NULL, 0, &booster);
  XGBoosterLoadModel(booster, "/afs/cern.ch/user/s/smorovic/public/Photon_NTL_158_Endcap_v1.bin");

  std::stringstream config;
  config << "{\"training\": false, \"type\": 0, \"iteration_begin\": 0, \"iteration_end\": " << best_ntree_limit_
         << ", \"strict_shape\": false}";
  std::string config_ = config.str();

  float var[9];

  var[0] = 134.303;
  var[1] = 0.945981;
  var[2] = 0.0264346;
  var[3] = 0.012448;
  var[4] = 0.0208734;
  var[5] = 113.405;
  var[6] = 1.7446;
  var[7] = 0.00437808;
  var[8] = 0.303464;

  DMatrixHandle dmat;
  XGDMatrixCreateFromMat(var, 1, 9, -999.9f, &dmat);
  uint64_t const* out_shape;
  uint64_t out_dim;
  const float* out_result = NULL;
  XGBoosterPredictFromDMatrix(booster, dmat, config_.c_str(), &out_shape, &out_dim, &out_result);
  float ret = out_result[0];
  XGDMatrixFree(dmat);

  float exp = 0.98634;
  printf(" ===TEST===     VAL: %f\n", ret);
  printf(" ===EXPECTED=== VAL: %f\n", exp);
}

 ===TEST===     VAL: 0.986344
 ===EXPECTED=== VAL: 0.986340

If applying this cmsdist patch: https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_14_1_X/master/xgboost-arm-and-ppc.patch There is also no difference.

Finally,compiling with gcc12 (14_0_2 cmsenv) also give the same result.

Making above into unit test and running it on ARM actually passes:

Fail    2s ... RecoEgamma/PhotonIdentification/RecoEgammaPhotonIdentificationTest
Pass    1s ... RecoEgamma/PhotonIdentification/RecoEgammaPhotonIdentificationTest2

so I'm starting to be suspicious that there is something wrong with the unit test itself (regarding portability).

smorovic commented 3 months ago

Found it. if (abs(etaSC) < 1.5) --> (std::abs(etaSC) < 1.5) fixes the unit test on ARM:

Pass    1s ... RecoEgamma/PhotonIdentification/RecoEgammaPhotonIdentificationTest
Pass    1s ... RecoEgamma/PhotonIdentification/RecoEgammaPhotonIdentificationTest2

I noticed earlier that on x86_64 abs is a floating point version, but apparently not on other architectures?

I will push this fix to the unit test (reusing the existing PR and backport).

makortel commented 3 months ago

I noticed earlier that on x86_64 abs is a floating point version, but apparently not on other architectures?

The C abs() function is actually for int, whereas fabs() is for double, i.e. the behavior on ARM was the correct one.

The only (or "easiest") way I could imagine the compiler's behavior on x86-64 would be that somehow it picked the C++ std::abs() instead of C abs(). But even that would sound strange.

VinInn commented 3 months ago

It sounds really strange!

smorovic commented 3 months ago

maybe some architecture specific header redefines it, for example #define abs fabs?

VinInn commented 3 months ago

The only way I can make it happening is to have

using namespace std;

or

using std::abs;

somehow conditional on compiling on x86_64

VinInn commented 3 months ago

 grep abs /data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/stdlib.h
using std::abs;
using std::labs;

if one wrongly does include<stdlib.h> it happens (also on ARM btw) with #include<cstdlib> it's ok ( abs integer, std::abs templated)

ditto for #include<math.h> vs #include<cmath>

play with https://godbolt.org/z/7fs559aWx

1) find math.h or stdlib.h 2) see if included only in x86_64 (clue the various intrinsics)

grep malloc /data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib/gcc/x86_64-redhat-linux-gnu/12.3.1/include/xmmintrin.h / Get _mm_malloc () and _mm_free (). /

include

[innocent@gputest-genoa-01 (gpu-c2e35-08-01) CMSSW_14_0_0]$ cat /tmp/test_PhotonMvaXgb.i | less [innocent@gputest-genoa-01 (gpu-c2e35-08-01) CMSSW_14_0_0]$ grep stdlib /data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib/gcc/x86_64-redhat-linux-gnu/12.3.1/include/mm_malloc.h

include

VinInn commented 3 months ago

with explicit include on xmmintrin.h

https://godbolt.org/z/cxo65rnr9

makortel commented 3 months ago

with explicit include on xmmintrin.h

Yeah, xmmintrin.h has

#include <mm_malloc.h>

which has

#include <stdlib.h>

(ah, this information was already in https://github.com/cms-sw/cmssw/issues/44542#issuecomment-2020613291)

VinInn commented 3 months ago

let's see what they say https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484

slava77 commented 3 months ago

I'm a bit lost in this thread discussion. Isn't explicit use of std::abs in our coding rule? Shouldn't the related unit test change from abs to std::abs? Or is there an evidence that std::abs can get overwritten in some circumstances by a C abs?

VinInn commented 3 months ago

On Mar 26, 2024, at 4:14 PM, Slava Krutelyov @.***> wrote:

I'm a bit lost in this thread discussion. Isn't explicit use of std::abs in our coding rule? YES. Shouldn't the related unit test change from abs to std::abs? YES. Or is there an evidence that std::abs can get overwritten in some circumstances by a C abs? NO. The opposite: C abs is overwritten by C++ std::abs in some circumstances.

smorovic commented 3 months ago

I'm a bit lost in this thread discussion. Isn't explicit use of std::abs in our coding rule? Shouldn't the related unit test change from abs to std::abs? Or is there an evidence that std::abs can get overwritten in some circumstances by a C abs?

Unit test is going to be fixed to use std::abs, this is already submitted to two PRs (mentioned). In fact std::abs was already correctly used in the related non-unit test code..

mmusich commented 3 months ago

Isn't explicit use of std::abs in our coding rule?

well, I would have also thought so, but there is still discussion: https://github.com/cms-sw/cms-sw.github.io/pull/99#discussion_r1526394978

guitargeek commented 2 months ago

Why didn't you just stick with the very nice GBRForest you have in CMSSW for the MVA? Then you don't need the XGBoost C API dependency, and the GBRForest has even better performance! And it's also more platform independent. Furthermore, these ML tools evolve quickly, so maintaining the dependency can be work.

Translating from XGBoost to GBRForest is easy. I do this in my library, where I renamed the GBRForest from CMS as "FastForest", but the code is almost the same.

Actually I'm about to bring the GBRForest into ROOT itself, rebranded as "RBDT" this time :laughing:

So if at some point CMS uses a newer ROOT version (6.32 I guess) and wants to avoid the XGBoost dependency, it will be very easy. Meaning if the issue is not urgent, it can also be waited out.

smorovic commented 2 months ago

Why didn't you just stick with the very nice GBRForest you have in CMSSW for the MVA? Then you don't need the XGBoost C API dependency, and the GBRForest has even better performance! And it's also more platform independent. Furthermore, these ML tools evolve quickly, so maintaining the dependency can be work.

Translating from XGBoost to GBRForest is easy. I do this in my library, where I renamed the GBRForest from CMS as "FastForest", but the code is almost the same.

Actually I'm about to bring the GBRForest into ROOT itself, rebranded as "RBDT" this time 😆

So if at some point CMS uses a newer ROOT version (6.32 I guess) and wants to avoid the XGBoost dependency, it will be very easy. Meaning if the issue is not urgent, it can also be waited out.

Hi @guitargeek, we looked at it, but we didn't go with it initially due to difficulties in translating models. We tried with XGBoost2TMVA and didn't get results we expected (it changed classifiers from [0,1] to [-1,1] range used by TMVA). And then we noticed that XGBoost is integrated as a tool, probably because of Python users...

We are aware that performance should be better with GBRForest. C API is optimized to run on multiple rows, and is relatively heavy on allocations when preparing to run inference..Besides it is also much slower on the full menu than running only selected paths (30 times!), I suspect either caching or heap allocations causing that. Still, for now, them impact on HLT menus is low, (around 0.3% range) which was accepted by TSG.

Therefore, initially we are using C API directly. I agree that we should try to migrate to GRBForest in subsequent releases.

Thank you for suggesting the tool, I will have a look at it (in a week when I'm back from vacation). Possible caveat is that "NTree limit" parameter of the model needs to be passed explicitely to the XGBoost C API (or even python API when loaded from the "bin" files that we're using) or inference always returns different result. Maybe it will be fine with txt files, but we need to try.

guitargeek commented 2 months ago

Thanks a lot for your answer, even from your vacations! Indeed the performance hit is big and you have to do memory allocations. But if you studies the performance impact and it turned out to be minimal, that's good.

I forgot about XGBoost2TMVA actually, indeed that would have been the easiest solution because the GBRForest can directly read it, if I remember correctly. I very well remember this problem with different outputs! I forgot the details, but applying an (inverse) logistic transformation and/or a simple re-scaling did the trick for me. Just plot the GBRForest and XGBoost output against each other in a scatter plot, and the functional relation will become apparent.

jfernan2 commented 2 weeks ago

@smorovic any further progress on this? Or has the issue been resolved with your PR? Thanks

smorovic commented 2 weeks ago

Hello, the problem with excessive creation of OpenMP threads was resolved by PR, so there is no urgent need to replace XGBoost.

Concerning migration away from XGBoost library, about two weeks ago I was looking at how to convert current "bin" files to TMVA models. So far failed to get something useful. It doesn't help that I didn't find code from my older attempt at this with XGBoost2TMVA. Once I manage it, I will start numerically comparing inference with GRBForest to XGBoost and, if needed, see how to limit iteration to get accurate results wrt. XGBoost.

cms-sw / cmssw

XGBoost different inference results on AMD64, ARM and PPC #44542

include

include