acts-project / traccc

Demonstrator tracking chain on accelerators
Mozilla Public License 2.0
29 stars 46 forks source link

CUDA Reconstruction Stuck (2024.05.04.) #569

Open krasznaa opened 5 months ago

krasznaa commented 5 months ago

Processing every ODD ttbar event that I made in #561, I have one that makes the reconstruction run forever. :confused:

[bash][Legolas]:traccc > ./out/build/sycl/bin/traccc_seq_example_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=1 --input-skip=2 

Running Full Tracking Chain Using CUDA

>>> Detector Options <<<
  Detector file       : geometries/odd/odd-detray_geometry_detray.json
  Material file       : 
  Surface rid file    : geometries/odd/odd-detray_surface_grids_detray.json
  Use detray::detector: yes
  Digitization file   : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : odd/geant4_ttbar_mu200/
  Number of input events        : 1
  Number of input events to skip: 2
>>> Clusterization Options <<<
  Target cells per partition: 1024
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 4294967295
  Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
  Constraint step size  : 3.40282e+38 [mm]
  Overstep tolerance    : -100 [um]
  Minimum mask tolerance: 1e-05 [mm]
  Maximum mask tolerance: 1 [mm]
  Search window         : 0 x 0
  Runge-Kutta tolerance : 0.0001
>>> Performance Measurement Options <<<
  Run performance checks: no
>>> Accelerator Options <<<
  Compare with CPU results: no

WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv

The application is just stuck on that file, with both my CPU and GPU reporting to be busy. :thinking:

I don't see this behaviour on any of the other events that I simulated. So I can imagine two things:

In the end, both of them are the same. :thinking: Since even on "bad events" we can't afford to go into an endless loop with our code.

Note that I didn't find yet which algorithm/kernel is doing it. Unfortunately attaching cuda-gdb to a running process is a lot more difficult than I first thought. :frowning: So I thought I'd open the issue with just this little information for now.

beomki-yeo commented 5 months ago

No idea. Is this right value to put? Maximum branches per step: 4294967295, (The name should be maximum branches per seed) I would set this to 10 and try again

stephenswat commented 5 months ago

No idea. Is this right value to put? Maximum branches per step: 4294967295, (The name should be maximum branches per seed) I would set this to 10 and try again

Note this is $2^{32}- 1$, so most likely an unsigned integer underflow.

beomki-yeo commented 5 months ago

Yes I worry that the CKFs are counting the branches forever with the maximum of maximum branches per seed.

krasznaa commented 5 months ago

No. That's really just the default value in our code.

This was a good suggestion, but apparently that's not where the code gets stuck. :frowning:

[bash][Legolas]:traccc > ./out/build/sycl/bin/traccc_seq_example_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=1 --input-skip=2 --nmax-per-seed=10

Running Full Tracking Chain Using CUDA

>>> Detector Options <<<
  Detector file       : geometries/odd/odd-detray_geometry_detray.json
  Material file       : 
  Surface rid file    : geometries/odd/odd-detray_surface_grids_detray.json
  Use detray::detector: yes
  Digitization file   : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : odd/geant4_ttbar_mu200/
  Number of input events        : 1
  Number of input events to skip: 2
>>> Clusterization Options <<<
  Target cells per partition: 1024
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 10
  Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
  Constraint step size  : 3.40282e+38 [mm]
  Overstep tolerance    : -100 [um]
  Minimum mask tolerance: 1e-05 [mm]
  Maximum mask tolerance: 1 [mm]
  Search window         : 0 x 0
  Runge-Kutta tolerance : 0.0001
>>> Performance Measurement Options <<<
  Run performance checks: no
>>> Accelerator Options <<<
  Compare with CPU results: no

WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv
beomki-yeo commented 5 months ago

Then it might be KF where the propagation gets stuck. But you really need to let us know where you get stuck. It is very easily done by putting "Hello" in front of each algorithm.

krasznaa commented 5 months ago

:stuck_out_tongue: Let's not get quite that basic just yet.

It is indeed the fitting. I can at least attach a normal GDB process, even if cuda-gdb is a bit more tricky.

(gdb) attach 192332
Attaching to process 192332
[New LWP 192333]
[New LWP 192337]
[New LWP 192338]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000077d8f1e67ba9 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x000077d8f1e67ba9 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#1  0x000077d8f1d80864 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#2  0x000077d8f1e4927a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#3  0x000077d8f20b80a9 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4  0x000077d8f1eee57d in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5  0x000077d8f3814105 in ?? () from /home/krasznaa/software/nvidia/cuda-12.4.1/x86_64/lib64/libcudart.so.12
#6  0x000077d8f3875018 in cudaStreamSynchronize () from /home/krasznaa/software/nvidia/cuda-12.4.1/x86_64/lib64/libcudart.so.12
#7  0x000077d8f4819d1d in traccc::cuda::stream::synchronize (this=<optimized out>)
    at /data/ssd-1tb/projects/traccc/traccc/device/cuda/src/utils/stream.cpp:62
#8  0x000077d8f482e5cb in _ZNK6traccc4cuda17fitting_algorithmINS_13kalman_fitterIN6detray10rk_stepperIN6covfie10field_viewINS5_7backend8constantINS5_6vector8vector_dIfLm3EEESB_EEEENS3_5cmathIfEENS3_16constrained_stepINS3_6darrayEEENS3_17stepper_rk_policyENS3_8stepping14void_inspectorESH_EENS3_9navigatorIKNS3_8detectorINS3_16default_metadataENS3_15container_typesIN6vecmem13device_vectorENS3_5tupleESH_NSR_20jagged_device_vectorENS3_4dmapEEEEENS3_10navigation14void_inspectorENS3_14intersection2DINS3_18surface_descriptorINS3_6detail11typed_indexINSP_8mask_idsEjjLj4026531840ELj268435455EEENS14_INSP_12material_idsEjjLj4026531840ELj268435455EEEjtEESF_EEEEEEEclERKNS13_18dmulti_view_helperILb1EJNSR_4data11vector_viewINS3_17volume_descriptorINSP_11geo_objectsENS14_INSP_9accel_idsEjjLj4026531840ELj268435455EEES18_EEEENS1G_INS3_11source_linkIS19_EEEENS1G_IN7algebra5cmath10transform3INS1R_6matrix5actorImNS1Q_6matrix10array_typeENS1V_11matrix_typeEfNS1T_11determinant5actorImS1X_fJNS1Y_17partial_pivot_ludImS1X_fNS1R_14element_getterImS1W_fEEJEEENS1Y_10hard_codedImS1X_fS22_JLm2ELm4EEEEEEENS1T_7inverse5actorImS1X_fJNS27_17partial_pivot_ludImS1X_fS22_JEEENS27_10hard_codedImS1X_fS22_JLm2ELm4EEEEEEES22_NS1R_12block_getterImS1W_fEEEEEEEENS1E_ILb1EJNS1G_INS3_4maskINS3_11rectangle2DEtSF_SH_EEEENS1G_INS2J_INS3_11trapezoid2DEtSF_SH_EEEENS1G_INS2J_INS3_9annulus2DEtSF_SH_EEEENS1G_INS2J_INS3_10cylinder2DEtSF_SH_EEEENS1G_INS2J_INS3_21concentric_cylinder2DEtSF_SH_EEEENS1G_INS2J_INS3_6ring2DEtSF_SH_EEEENS1G_INS2J_INS3_4lineILb0EEEtSF_SH_EEEENS1G_INS2J_INS32_ILb1EEEtSF_SH_EEEEEEENS1E_ILb1EJNS1E_ILb1EJNS1G_IjEENS1G_INS3_4bins6singleINS3_13material_slabIfEEEEEENS1G_ISt5arrayIjLm2EEEENS1G_IfEEEEES3L_S3L_S3L_NS1G_IS3E_EENS1G_INS3_12material_rodIfEEEENS1G_INS3_8materialIfSt5ratioILl1ELl1EEEEEES3L_S3L_EEENS1E_ILb1EJNS1E_ILb1EJS3A_NS1G_IS19_EEEEENS1E_ILb1EJS3A_NS1E_ILb1EJNS1G_INS3B_13dynamic_arrayIS19_E4dataEEES3W_EEES3J_S3K_EEES43_S43_S43_EEENS1E_ILb1EJNS1G_INS3C_IjEEEENS1E_ILb1EJS3J_S3K_EEEEEEEEERKSD_RKNS1F_18jagged_vector_viewIS1A_EERKNS_14container_viewIKNS3_22bound_track_parametersISF_EEKNS_11measurementEEE (this=0x7fff49da5790, det_view=..., field_view=..., navigation_buffer=..., 
    track_candidates_view=...) at /data/ssd-1tb/projects/traccc/traccc/device/cuda/src/fitting/fitting_algorithm.cu:97
#9  0x0000000000451445 in seq_run (detector_opts=..., input_opts=..., clusterization_opts=..., seeding_opts=..., 
    finding_opts=..., propagation_opts=..., performance_opts=..., accelerator_opts=...)
    at /data/ssd-1tb/projects/traccc/traccc/examples/run/cuda/seq_example_cuda.cpp:382
#10 0x000000000045516e in main (argc=<optimized out>, argv=<optimized out>)
    at /data/ssd-1tb/projects/traccc/traccc/examples/run/cuda/seq_example_cuda.cpp:560
(gdb)
beomki-yeo commented 5 months ago

Does this only happen for cuda? or also for cpu?

krasznaa commented 5 months ago

Fair question. But the host version does run through.

[bash][Legolas]:traccc > ./out/build/sycl/bin/traccc_seq_example --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=1 --input-skip=2 --nmax-per-seed=10

Running Full Tracking Chain on the Host

>>> Detector Options <<<
  Detector file       : geometries/odd/odd-detray_geometry_detray.json
  Material file       : 
  Surface rid file    : geometries/odd/odd-detray_surface_grids_detray.json
  Use detray::detector: yes
  Digitization file   : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : odd/geant4_ttbar_mu200/
  Number of input events        : 1
  Number of input events to skip: 2
>>> Clusterization Options <<<
  Target cells per partition: 1024
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 10
  Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
  Constraint step size  : 3.40282e+38 [mm]
  Overstep tolerance    : -100 [um]
  Minimum mask tolerance: 1e-05 [mm]
  Maximum mask tolerance: 1 [mm]
  Search window         : 0 x 0
  Runge-Kutta tolerance : 0.0001
>>> Track Ambiguity Resolution Options <<<
  Run ambiguity resolution : yes
>>> Performance Measurement Options <<<
  Run performance checks: no

WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv
==> Statistics ... 
- read     334026 cells from 16709 modules
- created  92925 measurements. 
- created  92925 space points. 
- created  15892 seeds
- found    21332 tracks
- fitted   21332 tracks
- resolved 4089 tracks
==> Elapsed times...
                    Read cells  560 ms
                Clusterization  14 ms
          Spacepoint formation  4 ms
                       Seeding  283 ms
       Track params estimation  2 ms
                 Track finding  2144 ms
                 Track fitting  499 ms
    Track ambiguity resolution  5065 ms
                     Wall time  8575 ms
[bash][Legolas]:traccc >
beomki-yeo commented 5 months ago

KF is supposed to reproduced the mathematically identical tracks found by CKF, at least that is what I intended Now I fear that this could be a bug from CKF that builds patterns wrongly, and KF gets suffer from those wrongly built pattern

beomki-yeo commented 5 months ago

Oh maybe KF is buggy now, which is not follownig the changes from #556 and #529 accordingly.

beomki-yeo commented 5 months ago

Also please make this specific data available somewhere so I can test and debug by meyself

krasznaa commented 5 months ago

Let's start with the important part first. I've put the ODD ttbar files here yesterday: https://cernbox.cern.ch/s/aLswvi2pNcBX9wr Just make sure that you have ~100 GB free space if you want to download it. :frowning: (~24 GB for the TGZ, and ~65 GB for the uncompressed files.) I could upload just the one problematic event as well if you'd like. :thinking:

At the same time: The plot thickens. After building the code with #568 locally fixed, in Debug mode, the bloody thing runs through. :open_mouth:

[bash][Legolas]:traccc > ./out/build/legolas/bin/traccc_seq_example_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=1 --input-skip=2 --nmax-per-seed=10

Running Full Tracking Chain Using CUDA

>>> Detector Options <<<
  Detector file       : geometries/odd/odd-detray_geometry_detray.json
  Material file       : 
  Surface rid file    : geometries/odd/odd-detray_surface_grids_detray.json
  Use detray::detector: yes
  Digitization file   : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : odd/geant4_ttbar_mu200/
  Number of input events        : 1
  Number of input events to skip: 2
>>> Clusterization Options <<<
  Target cells per partition: 1024
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 10
  Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
  Constraint step size  : 3.40282e+38 [mm]
  Overstep tolerance    : -100 [um]
  Minimum mask tolerance: 1e-05 [mm]
  Maximum mask tolerance: 1 [mm]
  Search window         : 0 x 0
  Runge-Kutta tolerance : 0.0001
>>> Performance Measurement Options <<<
  Run performance checks: no
>>> Accelerator Options <<<
  Compare with CPU results: no

WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv
==> Statistics ... 
- read    334026 cells from 16709 modules
- created (cpu)  0 measurements     
- created (cuda)  92925 measurements     
- created (cpu)  0 spacepoints     
- created (cuda) 92925 spacepoints     
- created  (cpu) 0 seeds
- created (cuda) 15892 seeds
- found (cpu)    0 tracks
- found (cuda)   21309 tracks
- fitted (cpu)   0 tracks
- fitted (cuda)  21309 tracks
==>Elapsed times...
           File reading  (cpu)  1413 ms
         Clusterization (cuda)  94 ms
   Spacepoint formation (cuda)  2 ms
                Seeding (cuda)  146 ms
           Track params (cuda)  5 ms
          Track finding (cuda)  32652 ms
          Track fitting (cuda)  11253 ms
                     Wall time  45587 ms
[bash][Legolas]:traccc >

I'll do some further testing next, but it very much seems to be some sort of a race condition. Which the slower-running debug binary manages to avoid. :thinking:

krasznaa commented 5 months ago

Unfortunately I'm none the wiser, now that I looked at the code for a bit.

I mean, the propagation code of course can get into an endless loop relatively easily.

https://github.com/acts-project/detray/blob/main/core/include/detray/propagator/propagator.hpp#L138-L180

So I even tried to "activate" the aborter that we have in traccc's fitting code.

https://github.com/acts-project/traccc/blob/main/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp#L109

Plus added one assertion in the only place where a memory error could relatively easily happen.

diff --git a/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp b/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp
index 898e6ef8..565a27ce 100644
--- a/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp
+++ b/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp
@@ -106,7 +106,7 @@ class kalman_fitter {
         }

         /// Individual actor states
-        typename aborter::state m_aborter_state{};
+        typename aborter::state m_aborter_state{detray::unit<scalar_type>::m};
         typename transporter::state m_transporter_state{};
         typename interactor::state m_interactor_state{};
         typename fit_actor::state m_fit_actor_state;
@@ -229,6 +229,7 @@ class kalman_fitter {
         auto& track_states = fitter_state.m_fit_actor_state.m_track_states;

         // Fit parameter = smoothed track parameter at the first surface
+        assert(!track_states.empty());
         fit_res.fit_params = track_states[0].smoothed();

         for (const auto& trk_state : track_states) {

But they made no difference for the code. It still goes into an endless loop in optimized mode, and finishes in debug mode. :confused:

Though at least I learned about nvtop along the way. :stuck_out_tongue:

image

stephenswat commented 5 months ago

https://github.com/acts-project/detray/blob/main/core/include/detray/propagator/propagator.hpp#L138-L180

Have you tried slapping a print statement on this loop to see if it loops forever? Of course if it really is a race condition the additional timing might avoid it, but could be worth a try.

beomki-yeo commented 5 months ago

Aborting with a limited path length won't help when the track is oscillating around the same surface. (I tried the same thing in the CKF and it was no-no) I suggest killing the track that makes more than certain number of steps: https://github.com/acts-project/traccc/blob/main/core/include/traccc/finding/ckf_aborter.hpp. You can put this condition into kalman_actor or make a dedicated aborter by yourself (e.g. kf_aborter).

But as I said, the infinite loop should not happen at all at least for traccc KF after CKF

niermann999 commented 5 months ago

In case it's oscillating around a surface, you could also try to either reduce the overstepping tolerance or increase the minimum step size (is that still checked correctly?). If the stepper cannot increase the step size enough after stepping onto a surface with a mini-step, we might indeed end up oscillating around that surface...

krasznaa commented 5 months ago

It should not happen, but pretty clearly it does. :thinking:

Adding

diff --git a/core/include/detray/propagator/propagator.hpp b/core/include/detray/propagator/propagator.hpp
index e361f4e6..3b47ec4c 100644
--- a/core/include/detray/propagator/propagator.hpp
+++ b/core/include/detray/propagator/propagator.hpp
@@ -151,8 +151,11 @@ struct propagator {
             m_navigator.update(propagation, m_cfg.navigation);

         // Run while there is a heartbeat
+        int i = 0;
         while (propagation._heartbeat) {

+            printf("starting iteration %i\n", i++);
+
             // Take the step
             propagation._heartbeat &=
                 m_stepper.step(propagation, m_cfg.stepping);

to the code, I ended up killing my test job at:

...
starting iteration 671441
starting iteration 671442
starting iteration 671443
starting iteration 671444
starting iteration 671445
starting iteration 671446
^C
[bash][Legolas]:traccc >

It's not actually a "race condition" that I suspect here. :thinking: Since this code is not doing any cooperation between threads that I could see. What I suspect is a floating-point issue. That the CKF for whatever reason propagates one particle just differently enough using "fast math" that the KF doesn't reproduce it.

So yeah, I'm very much getting the sense that an aborter, based on the number of iterations, is the way to go here...

beomki-yeo commented 5 months ago

I am pretty aware of that it clearly happens. The context is the following: The solution should be given by fixing the bug from CKF or KF instead of adding a hacky solution in the source code.

krasznaa commented 5 months ago

I am pretty aware of that it clearly happens. The context is the following: The solution should be given by fixing the bug from CKF or KF instead of adding a hacky solution in the source code.

I'm very happy to leave you to it. :thinking: Would you like some additional help with getting the cells that produce this behaviour? (I could put the files one-by-one onto EOS for instance.)

beomki-yeo commented 5 months ago

What I suspect is a floating-point issue. That the CKF for whatever reason propagates one particle just differently enough using "fast math" that the KF doesn't reproduce it.

The GPU or CUDA can generate different results from the exactly same sequence of the same calculations?

Would you like some additional help with getting the cells that produce this behaviour?

I will appreciate it

krasznaa commented 5 months ago

Though adding an "aborter" is still something that we should consider. Since this is likely not going to be the last bug in our code. And making the code print a warning and then continue, would be much preferable to getting into endless loops on weird events.

For context: We are just trying to process O(100) events here. In the foreseeable future I hope that we'll be able to go up to processing mullions. And at that point we'll need reasonable output about errors, and not just endlessly looping jobs. Since that's exactly what we had with the NSW reconstruction as well... :frowning:

beomki-yeo commented 5 months ago

And making the code print a warning and then continue,

Sounds legit.

krasznaa commented 5 months ago

You can find the one problematic event here: https://cernbox.cern.ch/s/jQ3TYzcLX0cAgQz

A "standard setup" of the main branch (with the latest data file downloaded for the ODD geometry files that I've been using), with this one event, should reproduce this endless loop, using the sort of commands that I've put a lot of into this issue.

Though I have to admit, I didn't even try it on a different NVIDIA GPU yet, just the one in my desktop. :thinking: So it's not out of the question that a little different GPU may not even reproduce the issue.