cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

[TF] Eigen unit tests on GPU failed #46333

Open smuzaffar opened 3 hours ago

smuzaffar commented 3 hours ago

Hi,

For tensorflow special IBs TF_X, where we have TF 2.17 (cuda build enabled) and new eigen https://github.com/cms-externals/eigen-git-mirror/tree/cms/master/c1d637433e3b3f9012b226c2c9125c494b470ae6 , few unit tests when use eigen are failing [a]. To reproduce this one can do

> ssh lxplus-gpu
> cd /tmp/$(whoami)
> cmssw-el8 --nv
> scram p CMSSW_14_2_TF_X_2024-10-08-1100
> cd CMSSW_14_2_TF_X_2024-10-08-1100
> cmsenv
> git cms-addpkg RecoTracker/PixelTrackFitting
> scram b -j 8
> scram b runtests_testEigenGPUNoFit_t

Note that we do apply https://github.com/cms-externals/eigen-git-mirror/commit/3cbe8e768c9c51af49d533eee3f3e96fd53e13d7 patch on top of eigen. So may be we are missing something to patch?

@fwyzard , do you have any idea howto fix this?

[a]

Pass    0s ... RecoTracker/PixelTrackFitting/testFits
Pass    0s ... RecoTracker/PixelTrackFitting/testFitsDump
Pass    0s ... RecoTracker/PixelTrackFitting/testEigenJacobian
Pass    0s ... RecoTracker/PixelTrackFitting/testRecoPixelVertexingPixelTrackFittingRZLine
Fail    3s ... RecoTracker/PixelTrackFitting/testFitsGPU_t
Fail    3s ... RecoTracker/PixelTrackFitting/testBrokenLineFitGPU_t
Fail    3s ... RecoTracker/PixelTrackFitting/testEigenGPUNoFit_t
Pass  158s ... RecoTracker/PixelTrackFitting/PixelTrackFits
Pass  158s ... RecoTracker/PixelTrackFitting/PixelTrackFits_Debug
Pass  158s ... RecoTracker/PixelTrackFitting/PixelTrackBrokenLineFit
> cat uunit_tests/testEigenGPUNoFit_t.lognit_tests/testEigenGPUNoFit_t.log
===== Test "testEigenGPUNoFit_t" ====
TEST EIGENVALUES
TEST INVERSE 3x3
TEST INVERSE 4x4
TEST INVERSE 5x5
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02858/el8_amd64_gcc12/external/eigen/c1d637433e3b3f9012b226c2c9125c494b470ae6-42b72b714d1a11d439b86af5ed2418e1/include/eigen3/Eigen/src/Core/PermutationMatrix.h:184: Derived &Eigen::PermutationBase<Derived>::applyTranspositionOnTheRight(long, long) [with Derived = Eigen::PermutationMatrix<5, 5, int>]: block: [0,0,0], thread: [0,0,0] Assertion `i >= 0 && j >= 0 && i < size() && j < size()` failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  
src/RecoTracker/PixelTrackFitting/test/testEigenGPUNoFit.cu, line 173:
cudaCheck(cudaMemcpy(mCPUret, mGPUret, sizeof(Matrix5d), cudaMemcpyDeviceToHost));
cudaErrorAssert: device-side assert triggered

/bin/sh: line 1: 3864396 Aborted                 (core dumped) sh -c 'testEigenGPUNoFit_t '

---> test testEigenGPUNoFit_t had ERRORS
TestTime:3
^^^^ End Test testEigenGPUNoFit_t ^^^^
smuzaffar commented 3 hours ago

assign RecoTracker/PixelTrackFitting

cmsbuild commented 3 hours ago

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 3 hours ago

cms-bot internal usage

cmsbuild commented 3 hours ago

A new Issue was created by @smuzaffar.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard commented 2 hours ago

FYI I will not be able to look into this (or other issues) until the end of November.