OPM / opm-simulators

OPM Flow and experimental simulators, including components such as well models etc.
http://www.opm-project.org
GNU General Public License v3.0
124 stars 121 forks source link

[OpenCL] Number of linear iterations fluctuate between multiple run of same model #2755

Open blattms opened 4 years ago

blattms commented 4 years ago

Running opm-tests/model1/BASE1_MSW_HFA multiple times the number of linear iterations is not the same for every run.

GitPaean commented 4 years ago

I tried both parallel and serial, I did not reproduce your problem.

There are other people also complaining about some random behavior, unfortunately, I did not reproduce either.

Maybe some linear solution component in the system setup has some components can introduce some randomness in the solution procedure?

blattms commented 4 years ago

Probably should have stated this in the text also (not just as a tag in the title).

The problem is only in the OpenCL code (if you explicitly ask for it with --gpu-mode=opencl).

GitPaean commented 4 years ago

Okay. I did not notice the OpenCL in the title.

I was thinking it is good to be able to reproduce and find out why about the randomness reported from some colleagues.

blattms commented 4 years ago

Reroducing is always highly appreciated. Thanks.

Maybe @ducbueno wants to do that?

blattms commented 4 years ago

This problem persists now that both std wells and ms wells are implemented. I get fluctuating number of iterations for SPE9 with current master (std well only) and with #2821 (readds ms wells). It happens both with --matrix-add-well-contributions=true and false. For Cusparse there are no fluctuations. I guess we need to check the ILU and bicgstab implementations.

ducbueno commented 4 years ago

I believe the fluctuations are due to the OpenCL BiCGStab implementation always using the zero vector an initial guess. There may some other issues with the ILU, but I haven't checked it

Tongdongq commented 4 years ago

The cusparseSolver also uses a zero vector as initial guess with cudaMemsetAsync(d_x, 0, sizeof(double) * N, stream); in update_system_on_gpu() and copy_system_to_gpu().

Tongdongq commented 4 years ago

@blattms @ducbueno I just remembered that the default OpenCL coloring strategy is set to GRAPH_COLORING as oppose to LEVEL_SCHEDULING in openclSolverBackend.cpp. This is generally faster, since it respects dependencies less. This results in more parallelism, but also in more linear iterations needed (up to 2x is normal). What's more, the graph coloring strategy has a random output. We could patch this by replacing https://github.com/OPM/opm-simulators/blob/ac3004da9deaf8841eade3c4394f5f5c6cffb95b/opm/simulators/linalg/bda/Reorder.cpp#L60 with std::mt19937 gen(constantSeed); This should solve the randomness issue, but the number of iterations will still be significantly higher than with LEVEL_SCHEDULING (cusparse and dune also use level scheduling).

blattms commented 4 years ago

Thanks for the clarification/explanation. Not sure which is way forward is the best here. IF we want to change someone should test it.

At least we should document somewhere prominent that there is randomness and the iteration numbers will differ. Maybe in the code and a file doc/READMES_GPU.txt? Other suggestions are welcome.

ducbueno commented 4 years ago

I can run some tests with the level scheduling reordering and also with the fixed random seed on the graph coloring.

I may take a little while to report back since today is a holiday in Brazil and I'll be away from the computer

Tongdongq commented 4 years ago

We could remove the randomness altogether. But it's not guaranteed to find a suitable coloring. If it fails, it should retry with a different seed. I'll run some tests too.

Tongdongq commented 4 years ago

With masters of 12 Oct 2020, 09:00. AMD Ryzen 5 2400G NVIDIA GTX 1050Ti CentOS 7, gcc 7.3.1 OpenCL 1.2 CUDA 11.1.70 driver: 455.23.05

opm-tests: LS: LEVEL_SCHEDULING GC: GRAPH_COLORING Seed GC: 0x5daefded The number of Linear Iterations for GC is constant now (with my local edit).

model1/BASE1_MSW_HFA Dune cusparse opencl LS opencl GC
Total time (s) 0.13 1.23 0.35 0.35
num Linearizations 21 21 21 21
num Newton Iterations 14 14 14 14
num Linear Iterations 58 53 56 137
model1/BASE1_MSW_HFA STDW Dune cusparse opencl LS opencl GC
Total time (s) 0.06 0.55 0.47 0.64
num Linearizations 21 21 61 62
num Newton Iterations 14 14 54 55
num Linear Iterations 56 47 192 511
model1/BASE1_MSW_HFA STDW in matrix Dune cusparse opencl LS opencl GC
Total time (s) 0.06 0.52 0.22 0.29
num Linearizations 21 21 21 21
num Newton Iterations 14 14 14 14
num Linear Iterations 54 45 54 132

model1/BASE2_MSW_HFA Dune cusparse opencl LS opencl GC
Total time (s) 0.26 1.05 5.8 7.28
num Linearizations 53 53 390 380
num Newton Iterations 38 38 369 359
num Linear Iterations 273 253 4605 7794
model1/BASE2_MSW_HFA STDW Dune cusparse opencl LS opencl GC
Total time (s) 0.20 0.76 2.81 3.41
num Linearizations 50 51 373 347
num Newton Iterations 35 36 353 328
num Linear Iterations 253 234 2476 4560
model1/BASE2_MSW_HFA STDW in matrix Dune cusparse opencl LS opencl GC
Total time (s) 0.20 1.28 0.54 0.57
num Linearizations 55 56 50 53
num Newton Iterations 40 41 35 38
num Linear Iterations 250 247 274 654

model1/BASE3_MSW_HFA Dune cusparse opencl LS opencl GC
Total time (s) 0.12 0.64 0.34 0.40
num Linearizations 24 24 24 24
num Newton Iterations 16 16 16 16
num Linear Iterations 100 91 95 233


model2/0_BASE_MODEL2.DATA Dune cusparse opencl LS opencl GC
Total time (s) 6.97 11.75 23.29 19.03
num Linearizations 486 463 601 597
num Newton Iterations 433 410 545 541
num Linear Iterations 5767 5227 9117 15218
model2/0_BASE_MODEL2.DATA STDW Dune cusparse opencl LS opencl GC
Total time (s) 6.68 11.15 21.88 19.19
num Linearizations 486 463 601 597
num Newton Iterations 433 410 545 541
num Linear Iterations 5767 5227 9117 15218
model2/0_BASE_MODEL2.DATA STDW in matrix Dune cusparse opencl LS opencl GC
Total time (s) 6.41 11.09 14.88 13.24
num Linearizations 501 458 463 478
num Newton Iterations 447 405 410 424
num Linear Iterations 5669 5040 5917 10636

For norne/NORNE_ATW2013_1A_STDW.DATA, opencl LS gets stuck at Report step 21, opencl GS at Report step 17, and cusparse at Report step 24. Dune is fine and takes 6 minutes. For norne/NORNE_ATW2013_1A_MSW.DATA, opencl LS gets stuck at Report step 60 after 80 minutes, opencl GS at Report step 39, cusparse at Report step 62 after 60 minutes. Dune finishes but takes 82 minutes with 26860 Linearizations and 139483 Linear Iterations. That's much higher than I expected. Apparently it had to shutdown 7 wells.

blattms commented 4 years ago

What does "get stuck" mean? chops the time step until it gives ups and throws an exception?

Tongdongq commented 4 years ago

Some I actually killed myself, some quit without a thrown exception after a few subsequent

Problem: Solver convergence failure - Iteration limit reached
Timestep chopped to x.xxx days

I'll rerun some tests to be more precise. There are some wells that are shutdown due to not being able to be converged. Is this expected for Dune?

blattms commented 4 years ago

Thanks for the numbers. Would you add total time to the tables please.

To sum this up:

For the big discrepancies, the number of newton steps increases drastically, too. Is that due to more time step chopping?

@Tongdongq as you probably have the best overview, would it be possible to summarize the known differences between the cusparse and openCL implementations. Maybe there is some striking difference?

These are all multisegment well problems. Do we see the same behavior for standard wells (e.g. with --use-multisegment-well=true parameter for the same models), too? If that is the case. additionallly how is it with --use-multisegment-well=true --matrix-add-well-contributions=true? If we see it there we could write out a problematic linear system and try to solve that with cusparse and see how it behaves.

Tongdongq commented 4 years ago

Only the GRAPH_COLORING will give randomness (if we don't choose a constant seed). The increased number of Newton steps seems to me like the openclSolver does not provide the same 'quality' in the solution, which means the outer loop has to do more iterations to reach its required levels of convergence.

One of the biggest differences is the reordering. The ILU decomposition in cusparse is handled by the library. For opencl it is done manually on the CPU and includes reordering of the rows. Although when I added the openclSolver to the masters of May, it did converge in reasonable time.

Do you mean --use-multisegment-well=false to run those models with standard wells?

blattms commented 4 years ago

Do you mean --use-multisegment-well=false to run those models with standard wells?

Yes that was what I meant.

ducbueno commented 4 years ago

Just reporting some results I got running the tests on my machine (with an Intel HD Graphics 620).

model1/BASE1_MSW_HFA Dune opencl LS opencl GC
Total time (s) 0,94 1,45 1,59
num linearizations 21 21 21
num newton 14 14 14
num linear iterations 58 56 132
model1/BASE2_MSW_HFA Dune opencl LS opencl GC
Total time (s) 2,88 3,31 4,25
num linearizations 54 51 51
num newton 39 36 36
num linear iterations 269 268 597
model1/BASE3_MSW_HFA Dune opencl LS opencl GC
Total time (s) 1,09 1,59 1,91
num linearizations 24 24 24
num newton 16 16 16
num linear iterations 100 95 222
model2/0_BASE_MODEL2 Dune opencl LS opencl GC
Total time (s) 298,16 119,5 100,89
num linearizations 486 418 479
num newton 433 366 426
num linear iterations 5767 6680 10241

I didn't get crazy linear iteration numbers with BASE2_MSW_HFA, and I'm also running norne and it also doesn't seem to break on my system (with opencl). I'll soon upload my norne results.

blattms commented 4 years ago

Great.

Hence it might depend on the hardware / opencl version etc. @Tongdongq what hardware was that,

Please add information as reported via "Platform version" and "CL_DEVICE_VERSION".

Tongdongq commented 4 years ago

It looks like standardwells applied separately causes problems on my machine. When putting them in the matrix, the number iterations for opencl (LS/GC) is normal. This also happened for a quick test with model1/BASE3_MSW_HFA.

ducbueno commented 4 years ago

Norne results with Intel HD Graphics 620.

norne/NORNE_ATW2013 Dune opencl LS opencl GC
Total time (s) 20297,07 6065,88 6919,48
num linearizations 1806 1838 2362
num newton 1470 1502 2021
num linear iterations 22509 23582 45148

Still haven't had time to simulate Norne with the multisegment wells. As soon as I have something I'll post

atgeirr commented 4 years ago

I find those timings for Norne strange, on my computer I can run Norne in less than 10 minutes in a serial run, down to about 195 seconds with 8 processes (one thread per process). This seems a lot slower, also for the Dune version, which I assume is the normal, default CPU-based solver?

blattms commented 4 years ago

Well, run times might depend on the machine used (memory speed, etc). Maybe not everybody has a powerful machine at home. I am running this on my machine and will report back(with opencl and graph coloring the iterations are roughly in the same ball park, but on my system it is 10x faster).

bska commented 4 years ago

Well, run times might depend on the machine used (memory speed, etc). Maybe not everybody has a powerful machine at home.

Obviously, but even on my 2014 vintage laptop I'm able to run a sequential simulation of the base NORNE_ATW2013 case in about 800 seconds. That's roughly 2.5x 25x faster than what's being reported here.

blattms commented 4 years ago

that is true of course. Maybe it is not optimized. @ducbueno can check how you build OPM? grep CXX_FLAGS CMakeCache.txt; grep BUILD_TYPE CMakeCache.txt

ducbueno commented 4 years ago

Result from grep CXX_FLAGS CMakeCache.txt'

CMAKE_CXX_FLAGS:STRING=-pipe -Wall -Wextra -Wshadow  -pthread -fopenmp
CMAKE_CXX_FLAGS_DEBUG:STRING=-g -O0 -DDEBUG
CMAKE_CXX_FLAGS_MINSIZEREL:STRING=-Os -DNDEBUG -O3 -mtune=native
CMAKE_CXX_FLAGS_RELEASE:STRING=-O3 -DNDEBUG -mtune=native
CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG -O3 -mtune=native
OpenMP_CXX_FLAGS:STRING=-fopenmp
//ADVANCED property for variable: CMAKE_CXX_FLAGS
CMAKE_CXX_FLAGS-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_DEBUG
CMAKE_CXX_FLAGS_DEBUG-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_MINSIZEREL
CMAKE_CXX_FLAGS_MINSIZEREL-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_RELEASE
CMAKE_CXX_FLAGS_RELEASE-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_RELWITHDEBINFO
CMAKE_CXX_FLAGS_RELWITHDEBINFO-ADVANCED:INTERNAL=1
//ADVANCED property for variable: OpenMP_CXX_FLAGS
OpenMP_CXX_FLAGS-ADVANCED:INTERNAL=1

Result from grep BUILD_TYPE CMakeCache.txt:

CMAKE_BUILD_TYPE:STRING=Debug
blattms commented 4 years ago

Ok. That is a debug build (because of CMAKE_BUILD_TYPE:STRING=Debug the CMAKE_CXX_FLAGS_DEBUG:STRING=-g -O0 debug flags will be used.) You should use CMAKE_BUILD_TYPE=Relase for benchmarks.

blattms commented 4 years ago

@ducbueno I did send you a script for easier building via email. HTH

blattms commented 4 years ago

Here are my numbers (AMD threadripper 950X 16-Core 3.6 GHz, GPU1: GeForce GTX 1060 6GB, GPU2: AMD Radeon XFX RX580 8GB)

norne/NORNE_ATW2013 CPU opencl GC GPU2 opencl GPU1 CUDA GPU1
Total time (s) 524.18 697 8992 515
num linearizations 1807 2250 310704 1848
num newton 1471 1913 29726 1512
num linear iterations 22545 43895 298887 22292

Unfortunately, using the NVIDIA GPU with opencl I see lots of time step chopping. I have added numbers for it, but they are outrageous.

Here are numbers for runs with --matrix-add-well-contributions=true:

norne/NORNE_ATW2013 CPU opencl GC GPU2 opencl GPU1 CUDA GPU1
Total time (s) 528 701 727 501
num linearizations 1777 2162 2198 1769
num newton 1443 1826 1861 1435
num linear iterations 21879 41870 41255 21092
blattms commented 4 years ago

OpenCl on my NVIDIA GPU smells quite fishy.

blattms commented 4 years ago

fishyness goes away if I run with --matrix-add-well-contributions=true. Something might be wrong with reordering when applying the wells?

blattms commented 4 years ago

Edited my previous comment to add numbers for --matrix-add-well-contributions=true.

Tongdongq commented 4 years ago

I also used --matrix-add-well-contributions=true for these:

norne/NORNE_ATW2013 Dune cusparse opencl LS opencl GC
Total time (s) 475.94 531.70 748.76 669.25
num Linearizations 1793 1780 1844 2145
num Newton Iterations 1458 1445 1507 1808
num Linear Iterations 21991 20912 22626 41281
Tongdongq commented 4 years ago

I took the masters of 2020-9-4 9:00, after https://github.com/OPM/opm-simulators/pull/2762 was merged. And found that opencl flow was not converging normally for norne/NORNE_ATW2013 with separate standardwells. I could not check https://github.com/OPM/opm-simulators/pull/2816 without also having https://github.com/OPM/opm-simulators/pull/2821 in there. Does anyone have a good date to use with git checkout `git rev-list -n 1 --before="$DATE" master`? I did run with 2020-10-2 9:00 which includes both PRs and see the same problems. I also did not see any exception thrown for Norne.

Tongdongq commented 4 years ago

I retested NORNE_ATW2013_1A_STDW with a higher maximum message count. Dune is fine and takes 6 minutes. cusparse with wellcontributions in the matrix also takes 6 minutes. cusparse with separate wellcontributions takes more than 1 hour and 15x more linear solves. opencl LS with separate wellcontributions takes even longer with even more linear solves and iterations. I suspect the WellContributionsOCLContainer also introduced a bug for cusparse, since that still uses the old WellContributions object. Further testing reveals that cusparse for NORNE_ATW2013_1A_STDW was not working with separate wellcontributions since the PR in March. This behavior is not seen for the normal NORNE_ATW, is there any difference that could explain this?

Tongdongq commented 4 years ago

I tested NORNE_ATW2013_1A_STDW and noticed that Dune and cusparse have the same number of linear solves for the first 2 Report steps, but opencl LS already differs in the first Time step. After a linear solve, StandardWellimpl.hpp:getWellConvergence() is called. The values in resWell[] are the same for Dune and cusparse, but different for opencl. resWell_[] is probably calculated in assembleWellEqWithoutIteration().

blattms commented 4 years ago

Thanks for this investigation. AFAIK the well equations are calculated from the cell values/intensive quantities in the simulator (still, not an expert of the well code). Which would mean that the result of the linear solve might be (quite) different.

One major difference is that we use reordering for openCL. Maybe we should have an option to skip that? That would allow testing it without reordering to further rule out possibilities.

Another option would be to write out a linear system from cusparse, with opencl read that in, solve and compare result. The question is which one to take (probably one at a later stage in the simulation).

Concerning the previous question (Difference Norne_ATW2013 and NORNE_ATW2013_1A_STDW), I am probably not competent enough for a decent answer (and my attempt, to do so now, might turn out to be quite embarrassing). But it appears that the standard Norne is a history matching case, while NORNE_ATW2013_1A_STDW is a prediction case, where the maximum flow from the wells seems limited. But somebody more knowledgeable should comment on this

Tongdongq commented 4 years ago

I made a new branch here. Not using reordering launches 244431 kernels for 1 ILU apply (as oppose to 2167 or 2*19), this increases the runtime extremely. Usage: --opencl-ilu-reorder=none

ducbueno commented 4 years ago

I made a new branch here where I was able to completely remove the WellContributionsOCLContainer class, and the well data is written to the GPU in the same way it is done in CUDA (that is, by "chunks" and before the GPU solver is called). On my Intel integrated graphics the code works flawlessly and on NVidia it chops the time steps in the same way it did with the WellContributionsOCLContainer class.

Tongdongq commented 3 years ago

WBHP-E-3AH This compares NORNE_ATW2013_1A_STDW opencl LS with Dune. Dune was run on a server, opencl LS on my machine and took 1h, 17377 Linearizations and 207774 Linear Iterations. Dune took 458s, 1007 Linearizations and 19935 Linear Iterations.

Tongdongq commented 3 years ago

https://github.com/OPM/opm-simulators/pull/3089#issuecomment-793081786

Actually, I might have one (or make a fool out of me again): I looked at the kernel for standard well application. To me it seems like we are missing some local memory synchronization when we are doing the local reduction on localSum in openclKernels.cpp#L433-L442. At least to me it seems like we reading from memory location that other threads have written, too. But there is no guarantee that the writes are in any order and hence we might try reading values before they have been written. We need to rewrite the code such that there are barrier(CLK_LOCAL_MEM_FENCE); before the summations and we also need to make sure that all workers of a workgroup actually reach these barriers.

Maybe on AMD advice the SIMD is wide enough such that valsPerBlock values are computed at once using vectorization and for my NVIDIA GPU they are not. That might explain the problems that I saw.

I did some quick tests adding barriers at various places, and rewriting the kernel slightly to make sure every thread hits the barriers. I did not see any difference in linear convergence. I also tested our nonpublic csolver, which is a simple, single-threaded CPU linear solver, it was slower than Dune, but produced the same linear convergence. Another possible issue is applying the wells simultaneously, perhaps there are some data hazards. This could occur if multiple wells write to the same rows of the matrix.

Keep in mind that cusparseSolver has the same issue.

blattms commented 3 years ago

Please tell me the branch and I will test. You did not see the problems on your cards, but on my system the NVIDIA had problems and the CPU as an opencl device with POCL.

Tongdongq commented 3 years ago

https://github.com/Tongdongq/opm-simulators/tree/add-memory-barrier-opencl-stdwell-apply