McStasMcXtrace / McCode

The home of the McStas (neutrons) and McXtrace (x-rays) Monte-Carlo ray-tracing instrument simulation codes.
https://github.com/McStasMcXtrace/McCode/wiki
GNU General Public License v3.0
77 stars 54 forks source link

Interpolation-lib issue on GPU with kdtree + kdtree vs. regular differences #1563

Closed willend closed 1 week ago

willend commented 7 months ago

Something seems off between interp GPU/CPU regular/kdtree - needs further investigation. A first step is introducing both kdtree and regular runs on both CPU and GPU.

CPU:

Screenshot 2024-02-09 at 08 54 18 Screenshot 2024-02-09 at 08 54 32

GPU:

Screenshot 2024-02-09 at 08 53 54 Screenshot 2024-02-09 at 08 54 07
* %Example: -n1e4 MF=flipfield.dat Detector: pol_6_I=1.0
* %Example: -n1e4 MF=constfield.dat Detector: polx_6_I=0.05
*
* %Parameters
* MF:              [string] Field definition file
* zdepth:               [m] Z-dimension of field
* interpol_method: [string] Choice of interpolation method "kdtree" (default on CPU) / "regular" (default on GPU)
sq-meng commented 7 months ago

edit: below is for flipfield.dat, still looking at constfield.dat... I looked at the test, the test field has Z steps at -0.5, -0.25, -0.025, 0.025, 0.25, 0.5 meters, which is not constant_step. The code should have rejected regular interpolation and used kdtree at this point, which it didn't. https://github.com/McStasMcXtrace/McCode/blob/be6ee82680373ea8f69bdade8913bc1b99d8be6d/common/lib/share/interpolation-lib.c#L411-L421 I think there are two problems:

  1. Every table line gets checked against interpolator->step[dim] instead of every unique value. This should have caused the code to fall back to kdtree even on regular data, however:
  2. Line 420, the fallback only triggers if interpolator->method is NULL or 0, which isn't the case if 'regular' is provided. If the data was forced to run on regular mode, the code would have assumed every z step to be 0.25 m wide, which causes the field to be no longer symmetric about z=0 and thus the wavelength dependent precession, instead of both halves cancelling out.
sq-meng commented 7 months ago

constfield.dat only has 12 data points, for regular interpolation there should be 3 x 3 x 3 = 27 points. Unpopulated grid cells remain at their as-allocated state and are invalid, causing invalid results. The two methods agree after filling out the missing values as such:

constfield.dat
# 12
-0.1 0.0 -0.5   0.0 0.0 -6.8e-05
0.1 0.0 -0.5    0.0 0.0 -6.8e-05
0.0 -0.1 -0.5   0.0 0.0 -6.8e-05
0.0 0.1 -0.5    0.0 0.0 -6.8e-05
-0.1 0.0 0.0   0.0 0.0 -6.8e-05
0.1 0.0 0.0    0.0 0.0 -6.8e-05
0.0 -0.1 0.0   0.0 0.0 -6.8e-05
0.0 0.1 0.0    0.0 0.0 -6.8e-05
-0.1 0.0 0.5   0.0 0.0 -6.8e-05
0.1 0.0 0.5    0.0 0.0 -6.8e-05
0.0 -0.1 0.5   0.0 0.0 -6.8e-05
0.0 0.1 0.5    0.0 0.0 -6.8e-05

0.0 0.0 -0.5 0.0 0.0 -6.8e-05
-0.1 -0.1 -0.5 0.0 0.0 -6.8e-05
-0.1 0.1 -0.5 0.0 0.0 -6.8e-05
0.1 -0.1 -0.5 0.0 0.0 -6.8e-05
0.1 0.1 -0.5 0.0 0.0 -6.8e-05

0.0 0.0 0.5 0.0 0.0 -6.8e-05
-0.1 -0.1 0.5 0.0 0.0 -6.8e-05
-0.1 0.1 0.5 0.0 0.0 -6.8e-05
0.1 -0.1 0.5 0.0 0.0 -6.8e-05
0.1 0.1 0.5 0.0 0.0 -6.8e-05

0.0 0.0 -0.0 0.0 0.0 -6.8e-05
-0.1 -0.1 -0.0 0.0 0.0 -6.8e-05
-0.1 0.1 -0.0 0.0 0.0 -6.8e-05
0.1 -0.1 -0.0 0.0 0.0 -6.8e-05
0.1 0.1 -0.0 0.0 0.0 -6.8e-05
willend commented 7 months ago

It all makes sense and I believe you are absolutely right in your analysis, and a supplemental simulation dataset point in the same direction:

In http://tmp.mcstas.org/oneoff_2024-02-09 CPU and GPU seem to nicely agree. Tests run are now these, explicitly setting method to 'kdtree' and 'regular':

* %Example: -n1e4 interpol_method=kdtree MF=flipfield.dat Detector: pol_6_I=1.0
* %Example: -n1e4 interpol_method=kdtree MF=constfield.dat Detector: polx_6_I=0.05
* %Example: -n1e4 interpol_method=regular MF=flipfield.dat Detector: pol_6_I=1.0
* %Example: -n1e4 interpol_method=regular MF=constfield.dat Detector: polx_6_I=0.05

(Full dataset can be downloaded here http://tmp.mcstas.org/oneoff_2024-02-09.tgz ~ 2Mb for 4 tests on each "system")

As I remember, the "default" interpolator is in fact set to "something" on both CPU (kdtree) and GPU (regular) which goes against the logic of the above code that only switches if "NULL" was set.

Edit:


Yup, indeed the component has :

  if (!strcmp(interpol_method,"default")) {
    #ifdef OPENACC
    sprintf(interpol_method,"regular");
    #else
    sprintf(interpol_method,"kdtree");
    #endif
  }

I should probably change that to "assume no default" / let things work adaptively, but again I seem to remember some specific issues in the GPU port on this point... How about you @ebknudsen ?

willend commented 7 months ago

CPU run overview-plots, same order as in the instrument file:

Screenshot 2024-02-09 at 10 40 59 Screenshot 2024-02-09 at 10 41 07 Screenshot 2024-02-09 at 10 41 16 Screenshot 2024-02-09 at 10 41 23

GPU run overview-plots, same order as in the instrument file:

Screenshot 2024-02-09 at 10 41 37 Screenshot 2024-02-09 at 10 41 47 Screenshot 2024-02-09 at 10 41 55 Screenshot 2024-02-09 at 10 42 02
willend commented 7 months ago

I am now collecting old datasets from historical tests to check if the flipping actually ever worked on GPU...

(Unfortunately we've lost the nightly test system since ~ January 1st, but I believe it is safe to assume that perhaps some minor detail in the recent edits is not thread-safe... :-) I will get around to digging... But likely not until Feb 19th or so, still recovering from flu + kid is off on winter break next week.)


Edit: diff between generated code from then and now - differences are extremely small... Can't immediately spot what the issue is, but I am sure that once my brain is less "cloudy" I will nail it. :-)

4,5c4,5
<  * Instrument: /home/nexmap/pkwi/TESTS/2023-12-20/McStas_8GPU_5e6/Test_Pol_Tabled/Test_Pol_Tabled.instr (Test_Pol_Tabled)
<  * Date:       Wed Dec 20 02:58:58 2023
---
>  * Instrument: /home/nexmap/pkwi/WORK/oneoff_2024-02-09/20240209_0911_58/3.x-dev/Test_Pol_Tabled/Test_Pol_Tabled.instr (Test_Pol_Tabled)
>  * Date:       Fri Feb  9 09:12:01 2024
13c13
< #define MCCODE_STRING "McStas 3.x-dev - Dec. 19, 2023"
---
> #define MCCODE_STRING "McStas 3.x-dev - Feb. 08, 2024"
53d52
<   long long _loopid; /* inner-loop event ID */
324c323
< #  define MCCODE_STRING "McStas 3.x-dev - Dec. 19, 2023"
---
> #  define MCCODE_STRING "McStas 3.x-dev - Feb. 08, 2024"
328c327
< #  define MCCODE_DATE "Dec. 19, 2023"
---
> #  define MCCODE_DATE "Feb. 08, 2024"
6238c6237
< char  instrument_source[] = "/home/nexmap/pkwi/TESTS/2023-12-20/McStas_8GPU_5e6/Test_Pol_Tabled/Test_Pol_Tabled.instr";
---
> char  instrument_source[] = "/home/nexmap/pkwi/WORK/oneoff_2024-02-09/20240209_0911_58/3.x-dev/Test_Pol_Tabled/Test_Pol_Tabled.instr";
6240c6239
< char  instrument_code[]   = "Instrument Test_Pol_Tabled source code /home/nexmap/pkwi/TESTS/2023-12-20/McStas_8GPU_5e6/Test_Pol_Tabled/Test_Pol_Tabled.instr is not embedded in this executable.\n  Use --source option when running McStas.\n";
---
> char  instrument_code[]   = "Instrument Test_Pol_Tabled source code /home/nexmap/pkwi/WORK/oneoff_2024-02-09/20240209_0911_58/3.x-dev/Test_Pol_Tabled/Test_Pol_Tabled.instr is not embedded in this executable.\n  Use --source option when running McStas.\n";
8147c8146
<   int treek = tree->point->v[k];
---
>   double treek = tree->point->v[k];
8183c8182,8193
<   return ( *(double*)a - *(double*)b );
---
>   if (*(double*)a > *(double*)b)
>   {
>     return 1;
>   }
>   else if (*(double*)a < *(double*)b)
>   {
>     return -1;
>   }
>   else
>   {
>     return 0;
>   }
8307a8318
>     interpolator->bin[dim] = 1;
8344a8356
>     free(vector);
8383c8395
<           indices[axis] = floor((x - interpolator->min[axis])/interpolator->step[axis]);
---
>           indices[axis] = round((x - interpolator->min[axis])/interpolator->step[axis]);
8484c8496
<       indices[axis] = (space[axis]-interpolator->min[axis])/interpolator->step[axis];
---
>       indices[axis] = round((space[axis]-interpolator->min[axis])/interpolator->step[axis]);
10115c10127
<     fprintf(stdout, "%d ", (int)(ncount*100.0/mcget_ncount())); fflush(stdout);
---
>     fprintf(stdout, "%llu ", (unsigned long long)(ncount*100.0/mcget_ncount())); fflush(stdout);
10688c10700
<   #define ABSORB0 do { DEBUG_ABSORB(); MAGNET_OFF; ABSORBED++; } while(0)
---
>   #define ABSORB0 do { DEBUG_ABSORB(); MAGNET_OFF; ABSORBED++; return(ABSORBED);} while(0)
11001c11013
<       particleN._uid = particleN._loopid = pidx;
---
>       particleN._uid = pidx;
11074c11086
<       _particle->_uid = _particle->_loopid = pidx;
---
>       _particle->_uid = pidx;
sq-meng commented 7 months ago

@willend The new usage of round() could change results. In the constant field case, x and y data are given at -0.1, 0.0 and 0.1, and the polarization detector is from -0.05 to 0.05 in both x and y axes. If round() is used, all counted neutrons would fall inside the [-0.05, 0.05] bin, which is not populated in the original 12-line field file. But if the numbers are not rounded, half of the neutrons would fall into populated bins.

willend commented 7 months ago

@sq-meng something is still iffy somewhere: Using kdtree does not yield the same results as regular in the case of the constant field - neither on CPU or GPU.

With the new file in place things look similar on GPU and CPU with "regular" interpolation - but with also with a different spin oscillation periodicity than earlier.

Regular interpolation, CPU: Screenshot 2024-02-09 at 12 51 04

Regular interpolation, GPU: Screenshot 2024-02-09 at 12 51 23

kdtree, CPU: Screenshot 2024-02-09 at 12 51 43

kdtree, GPU: Screenshot 2024-02-09 at 12 51 57


Edit: AND in fact, reverting changes to the qsort comparator function seems to partially rectify this?

Patched, CPU, regular: Screenshot 2024-02-09 at 12 58 55

Patched, CPU, kdtree: Screenshot 2024-02-09 at 12 59 09

Patched, GPU, regular: Screenshot 2024-02-09 at 12 56 28

GPU + kdtree simply doesn't work, and I suspect it might have never really...

willend commented 7 months ago

So in conclusion, at the moment:

Not judging what the "output should exactly be", given the simple constant-field datafile it should at least give the same irrespective of interpolation method - at least if we for now exclude the somewhat "experimental" GPU platform... 🚧 (BUT: We should probably go back with pen+paper and look at a "hand-held" propagation of neutrons of the relevant wavelengths through the relevant field to decide from a physics point of view...)

I think this indicates that either the comparator-function needs a revert / another overhaul - or at least something further needs doing on the kdtree side, also on CPU... Agreed @sq-meng ?

sq-meng commented 7 months ago

@willend I agree - something is still iffy. And to make things worse, I was not able to get the longer period oscillations on my machine with either kdtree or regularπŸ˜‚. This definitely still needs more work.

sq-meng commented 7 months ago

@willend Could you check if your constant field file has 27 lines? My original reply containing the field file missed last 5 lines and I edited to add these soon after. But if you got the unedited version, the z=0.0 plane would have remained unpopulated, explaining the longer period.

willend commented 7 months ago

Right! I was missing the last block! I will rerun and see where this takes us on my side...

willend commented 7 months ago

And you were right this updates the situation to:

CPU regular: Screenshot 2024-02-09 at 14 20 28

CPU kdtree: Screenshot 2024-02-09 at 14 20 40

GPU regular: Screenshot 2024-02-09 at 14 20 26

Flipping the "CPU default" to "NULL" should make it adaptive - and keeping the GPU side to "regular" unless explicitly chosen otherwise makes sense for now.

willend commented 7 months ago

One of the typical things that could give this type of difference is if some variable initially in the COMPONENT scope (i.e. the Table object) gets modified during TRACE - thus leading to memory/data overlap between the GPU threads... But I didn't spot that up to now...

willend commented 7 months ago

Actual a VERY obvious candidate is the "R_SWAP"... macro.

I am afraid every neutron will either need its own copy of the interpolator - or a function with one or more #pragma acc atomic claused need writing... (Then the global interpolator gets updated via "locks" - only access by one thread at a time...)

sq-meng commented 7 months ago

Can't comment on that as I have zero knowledge on GPU programming... But I reckon the tons of branches in kd-tree won't perform on GPU anyway. Glad we at least have the result mismatch situation closed!

willend commented 7 months ago

I think my choice will in fact be to effectively disable "kdtree" on GPU. Exit with a warning if anything other than "default" / "regular" is chosen.

The other stuff will either become a) very memory demanding, i.e. an "interpolator" copy pr. GPU thread or b) Slow due to use of #pragma acc atomics (i.e. mutex'es that lock memory on acces to allow one single thread at a time...)

willend commented 7 months ago

Heh, interestingly the size of flipfield.dat is not what it claims to be...

wc -l flipfield.dat 
151 flipfield.dat
(mcstas-3.x-dev/miniconda3) data $ head -2 flipfield.dat 
# 300
-0.1 -0.1 -0.5 0.0 0.0 1.0e-05

But I am not sure the # 300 header has any practical meaning in this particular case....

willend commented 7 months ago

Ha! The remaining issue with the flipfield.dat file is apparently one of "irregular" array dimensions! Apparently there is an underlying assumption not only that the bins along each of the dimensions are uniform length, but also that Nx == Ny == Nz:

The file is 5x5x6 naturally which (forcing regular - this occurs both on CPU and GPU) gives this output in the flip-case: Screenshot 2024-02-09 at 18 26 00

I have now written up a short Matlab (yes I know, better do it in Python ;-)) script to resample a field file: remesh.m.txt

Using this I was able to resample to e.g. 6x6x6: remeshed_6x6x6_flipfield.dat.txt

Which now performs correctly in both the CPU and GPU cases with regular interpolation!

CPU: Screenshot 2024-02-09 at 18 41 27

GPU: Screenshot 2024-02-09 at 18 39 22

I will simply replace the file with the remeshed one (and as I assumed, the header is not used for anything...)

I believe all that remains is either documenting these 'fragile' aspects or even exit fatal in the case the file properties are not "actually, fully regular" when requesting this method (meaning when forcing regular on CPU or "always" on GPU...

willend commented 7 months ago

@sq-meng thanks for the ping-pong! :) I believe we have arrived at a good solution, so marking "ready for release". (This is my hack to help when I write a CHANGELOG on releasing.)

Please let me know if you find anything further that is off - and have a nice weekend!

willend commented 7 months ago

@farhi this stuff is likely also important in several places in McXtrace, do try it out when you feel like it. In essence:

willend commented 7 months ago

https://github.com/McStasMcXtrace/McCode/blob/main/mcstas-comps/examples/Tests_polarization/Test_Pol_Tabled/ now has a couple of remeshing-tools and the input files alongside the instrument.

sq-meng commented 7 months ago

@willend Looks mostly good to me! I don't believe that regular interpolation requires Nx == Ny == Nz though: Nsteps and step sizes can safely vary between axes, and the only requirement is that step size within each axis is consistent. The original flipfield.dat has z-axis steps at -0.5, -0.25, -0.025, 0.025, 0.25, 0.5, the middle 2 numbers break the 0.25 step size. Attached is a 5x5x10 flip field which works fine here: 5-5-10_flipfield.zip

willend commented 7 months ago

Ah, of course! I have simply overlooked the non-consistent axes bins...
(.... But of course resampling to a regular 6x6x6 worked :-) )

I will remodel the part now insisting on Nx == Ny == Nz to instead become a warning/error if regular and inconsistent axis.(I feel leaving a solution in place that interpolates wrongly is bad customer service. )

sq-meng commented 7 months ago

Sure. Silently producing slightly wrong results is the worst behavior for scientific software so a warning against unexpected data is necessary.

willend commented 7 months ago

Ended up with an actual exit(-1), which I think is for the better (here run with the original flipfield.dat):

(mcstas-3.x-dev/miniconda3) Test_Pol_Tabled-hack $ mcrun Test_Pol_Tabled.instr  MF=flipfield.dat interpol_method=regular -n1e4 
[Test_Pol_Tabled] Initialize
INFO: Set_pol(setpol): Setting polarization to (0.000000, 1.000000, 0.000000)
Opening input file 'flipfield.dat' (Table_Read_Offset)
interpolator_load: Axis 0: step=0.05, unique values=5, from file 'flipfield.dat'.
interpolator_load: Axis 1: step=0.05, unique values=5, from file 'flipfield.dat'.

!! interpolation-lib ERROR: !!
   You are running the 'regular' interpolation scheme with a file of
   non-consistent axis 'binning' along one or more axes.
   This combination is not possible.
   Please either resample the file to a regular grid or run with 'kdtree'
   (NB: kdtree is available on CPU only)

Using default reverts automatically to kdtree and lets the simulation run.

willend commented 7 months ago

BTW I also spotted yet another annoyance in the regular -> kdtree switch case logic: https://github.com/McStasMcXtrace/McCode/blob/195d18aa0bed5b06958c5923a58149aafda44ee7/common/lib/share/interpolation-lib.c#L434-L462

An if (this_step was actually missing - so even on actual regular grids, the non-unique axis-positions from the "meshing" would trigger a switch to kdtree :-)

All in all a good solution in place for all cases by now I believe. I thank you for contributing to this little exercise of bringing a useful but admittedly unpolished gem into much better order!