Open blattms opened 4 years ago
I tried both parallel and serial, I did not reproduce your problem.
There are other people also complaining about some random behavior, unfortunately, I did not reproduce either.
Maybe some linear solution component in the system setup has some components can introduce some randomness in the solution procedure?
Probably should have stated this in the text also (not just as a tag in the title).
The problem is only in the OpenCL code (if you explicitly ask for it with --gpu-mode=opencl).
Okay. I did not notice the OpenCL
in the title.
I was thinking it is good to be able to reproduce and find out why about the randomness reported from some colleagues.
Reroducing is always highly appreciated. Thanks.
Maybe @ducbueno wants to do that?
This problem persists now that both std wells and ms wells are implemented. I get fluctuating number of iterations for SPE9 with current master (std well only) and with #2821 (readds ms wells). It happens both with --matrix-add-well-contributions=true
and false. For Cusparse there are no fluctuations. I guess we need to check the ILU and bicgstab implementations.
I believe the fluctuations are due to the OpenCL BiCGStab implementation always using the zero vector an initial guess. There may some other issues with the ILU, but I haven't checked it
The cusparseSolver also uses a zero vector as initial guess with
cudaMemsetAsync(d_x, 0, sizeof(double) * N, stream);
in update_system_on_gpu()
and copy_system_to_gpu()
.
@blattms @ducbueno
I just remembered that the default OpenCL coloring strategy is set to GRAPH_COLORING as oppose to LEVEL_SCHEDULING in openclSolverBackend.cpp
.
This is generally faster, since it respects dependencies less. This results in more parallelism, but also in more linear iterations needed (up to 2x is normal). What's more, the graph coloring strategy has a random output.
We could patch this by replacing
https://github.com/OPM/opm-simulators/blob/ac3004da9deaf8841eade3c4394f5f5c6cffb95b/opm/simulators/linalg/bda/Reorder.cpp#L60
with std::mt19937 gen(constantSeed);
This should solve the randomness issue, but the number of iterations will still be significantly higher than with LEVEL_SCHEDULING (cusparse and dune also use level scheduling).
Thanks for the clarification/explanation. Not sure which is way forward is the best here. IF we want to change someone should test it.
At least we should document somewhere prominent that there is randomness and the iteration numbers will differ. Maybe in the code and a file doc/READMES_GPU.txt? Other suggestions are welcome.
I can run some tests with the level scheduling reordering and also with the fixed random seed on the graph coloring.
I may take a little while to report back since today is a holiday in Brazil and I'll be away from the computer
We could remove the randomness altogether. But it's not guaranteed to find a suitable coloring. If it fails, it should retry with a different seed. I'll run some tests too.
With masters of 12 Oct 2020, 09:00. AMD Ryzen 5 2400G NVIDIA GTX 1050Ti CentOS 7, gcc 7.3.1 OpenCL 1.2 CUDA 11.1.70 driver: 455.23.05
opm-tests: LS: LEVEL_SCHEDULING GC: GRAPH_COLORING Seed GC: 0x5daefded The number of Linear Iterations for GC is constant now (with my local edit).
model1/BASE1_MSW_HFA | Dune | cusparse | opencl LS | opencl GC |
---|---|---|---|---|
Total time (s) | 0.13 | 1.23 | 0.35 | 0.35 |
num Linearizations | 21 | 21 | 21 | 21 |
num Newton Iterations | 14 | 14 | 14 | 14 |
num Linear Iterations | 58 | 53 | 56 | 137 |
model1/BASE1_MSW_HFA STDW | Dune | cusparse | opencl LS | opencl GC |
---|---|---|---|---|
Total time (s) | 0.06 | 0.55 | 0.47 | 0.64 |
num Linearizations | 21 | 21 | 61 | 62 |
num Newton Iterations | 14 | 14 | 54 | 55 |
num Linear Iterations | 56 | 47 | 192 | 511 |
model1/BASE1_MSW_HFA STDW in matrix | Dune | cusparse | opencl LS | opencl GC |
---|---|---|---|---|
Total time (s) | 0.06 | 0.52 | 0.22 | 0.29 |
num Linearizations | 21 | 21 | 21 | 21 |
num Newton Iterations | 14 | 14 | 14 | 14 |
num Linear Iterations | 54 | 45 | 54 | 132 |
model1/BASE2_MSW_HFA | Dune | cusparse | opencl LS | opencl GC |
---|---|---|---|---|
Total time (s) | 0.26 | 1.05 | 5.8 | 7.28 |
num Linearizations | 53 | 53 | 390 | 380 |
num Newton Iterations | 38 | 38 | 369 | 359 |
num Linear Iterations | 273 | 253 | 4605 | 7794 |
model1/BASE2_MSW_HFA STDW | Dune | cusparse | opencl LS | opencl GC |
---|---|---|---|---|
Total time (s) | 0.20 | 0.76 | 2.81 | 3.41 |
num Linearizations | 50 | 51 | 373 | 347 |
num Newton Iterations | 35 | 36 | 353 | 328 |
num Linear Iterations | 253 | 234 | 2476 | 4560 |
model1/BASE2_MSW_HFA STDW in matrix | Dune | cusparse | opencl LS | opencl GC |
---|---|---|---|---|
Total time (s) | 0.20 | 1.28 | 0.54 | 0.57 |
num Linearizations | 55 | 56 | 50 | 53 |
num Newton Iterations | 40 | 41 | 35 | 38 |
num Linear Iterations | 250 | 247 | 274 | 654 |
model1/BASE3_MSW_HFA | Dune | cusparse | opencl LS | opencl GC |
---|---|---|---|---|
Total time (s) | 0.12 | 0.64 | 0.34 | 0.40 |
num Linearizations | 24 | 24 | 24 | 24 |
num Newton Iterations | 16 | 16 | 16 | 16 |
num Linear Iterations | 100 | 91 | 95 | 233 |
model2/0_BASE_MODEL2.DATA | Dune | cusparse | opencl LS | opencl GC |
---|---|---|---|---|
Total time (s) | 6.97 | 11.75 | 23.29 | 19.03 |
num Linearizations | 486 | 463 | 601 | 597 |
num Newton Iterations | 433 | 410 | 545 | 541 |
num Linear Iterations | 5767 | 5227 | 9117 | 15218 |
model2/0_BASE_MODEL2.DATA STDW | Dune | cusparse | opencl LS | opencl GC |
---|---|---|---|---|
Total time (s) | 6.68 | 11.15 | 21.88 | 19.19 |
num Linearizations | 486 | 463 | 601 | 597 |
num Newton Iterations | 433 | 410 | 545 | 541 |
num Linear Iterations | 5767 | 5227 | 9117 | 15218 |
model2/0_BASE_MODEL2.DATA STDW in matrix | Dune | cusparse | opencl LS | opencl GC |
---|---|---|---|---|
Total time (s) | 6.41 | 11.09 | 14.88 | 13.24 |
num Linearizations | 501 | 458 | 463 | 478 |
num Newton Iterations | 447 | 405 | 410 | 424 |
num Linear Iterations | 5669 | 5040 | 5917 | 10636 |
For norne/NORNE_ATW2013_1A_STDW.DATA, opencl LS gets stuck at Report step 21, opencl GS at Report step 17, and cusparse at Report step 24. Dune is fine and takes 6 minutes.
For norne/NORNE_ATW2013_1A_MSW.DATA, opencl LS gets stuck at Report step 60 after 80 minutes, opencl GS at Report step 39, cusparse at Report step 62 after 60 minutes. Dune finishes but takes 82 minutes with 26860 Linearizations and 139483 Linear Iterations. That's much higher than I expected. Apparently it had to shutdown 7 wells.
What does "get stuck" mean? chops the time step until it gives ups and throws an exception?
Some I actually killed myself, some quit without a thrown exception after a few subsequent
Problem: Solver convergence failure - Iteration limit reached
Timestep chopped to x.xxx days
I'll rerun some tests to be more precise. There are some wells that are shutdown due to not being able to be converged. Is this expected for Dune?
Thanks for the numbers. Would you add total time to the tables please.
To sum this up:
For the big discrepancies, the number of newton steps increases drastically, too. Is that due to more time step chopping?
@Tongdongq as you probably have the best overview, would it be possible to summarize the known differences between the cusparse and openCL implementations. Maybe there is some striking difference?
These are all multisegment well problems. Do we see the same behavior for standard wells (e.g. with --use-multisegment-well=true
parameter for the same models), too? If that is the case. additionallly how is it with --use-multisegment-well=true --matrix-add-well-contributions=true
? If we see it there we could write out a problematic linear system and try to solve that with cusparse and see how it behaves.
Only the GRAPH_COLORING will give randomness (if we don't choose a constant seed). The increased number of Newton steps seems to me like the openclSolver does not provide the same 'quality' in the solution, which means the outer loop has to do more iterations to reach its required levels of convergence.
One of the biggest differences is the reordering. The ILU decomposition in cusparse is handled by the library. For opencl it is done manually on the CPU and includes reordering of the rows. Although when I added the openclSolver to the masters of May, it did converge in reasonable time.
Do you mean --use-multisegment-well=false
to run those models with standard wells?
Do you mean
--use-multisegment-well=false
to run those models with standard wells?
Yes that was what I meant.
Just reporting some results I got running the tests on my machine (with an Intel HD Graphics 620).
model1/BASE1_MSW_HFA | Dune | opencl LS | opencl GC |
---|---|---|---|
Total time (s) | 0,94 | 1,45 | 1,59 |
num linearizations | 21 | 21 | 21 |
num newton | 14 | 14 | 14 |
num linear iterations | 58 | 56 | 132 |
model1/BASE2_MSW_HFA | Dune | opencl LS | opencl GC |
---|---|---|---|
Total time (s) | 2,88 | 3,31 | 4,25 |
num linearizations | 54 | 51 | 51 |
num newton | 39 | 36 | 36 |
num linear iterations | 269 | 268 | 597 |
model1/BASE3_MSW_HFA | Dune | opencl LS | opencl GC |
---|---|---|---|
Total time (s) | 1,09 | 1,59 | 1,91 |
num linearizations | 24 | 24 | 24 |
num newton | 16 | 16 | 16 |
num linear iterations | 100 | 95 | 222 |
model2/0_BASE_MODEL2 | Dune | opencl LS | opencl GC |
---|---|---|---|
Total time (s) | 298,16 | 119,5 | 100,89 |
num linearizations | 486 | 418 | 479 |
num newton | 433 | 366 | 426 |
num linear iterations | 5767 | 6680 | 10241 |
I didn't get crazy linear iteration numbers with BASE2_MSW_HFA, and I'm also running norne and it also doesn't seem to break on my system (with opencl). I'll soon upload my norne results.
Great.
Hence it might depend on the hardware / opencl version etc. @Tongdongq what hardware was that,
Please add information as reported via "Platform version" and "CL_DEVICE_VERSION".
It looks like standardwells applied separately causes problems on my machine. When putting them in the matrix, the number iterations for opencl (LS/GC) is normal. This also happened for a quick test with model1/BASE3_MSW_HFA.
Norne results with Intel HD Graphics 620.
norne/NORNE_ATW2013 | Dune | opencl LS | opencl GC |
---|---|---|---|
Total time (s) | 20297,07 | 6065,88 | 6919,48 |
num linearizations | 1806 | 1838 | 2362 |
num newton | 1470 | 1502 | 2021 |
num linear iterations | 22509 | 23582 | 45148 |
Still haven't had time to simulate Norne with the multisegment wells. As soon as I have something I'll post
I find those timings for Norne strange, on my computer I can run Norne in less than 10 minutes in a serial run, down to about 195 seconds with 8 processes (one thread per process). This seems a lot slower, also for the Dune version, which I assume is the normal, default CPU-based solver?
Well, run times might depend on the machine used (memory speed, etc). Maybe not everybody has a powerful machine at home. I am running this on my machine and will report back(with opencl and graph coloring the iterations are roughly in the same ball park, but on my system it is 10x faster).
Well, run times might depend on the machine used (memory speed, etc). Maybe not everybody has a powerful machine at home.
Obviously, but even on my 2014 vintage laptop I'm able to run a sequential simulation of the base NORNE_ATW2013 case in about 800 seconds. That's roughly 2.5x 25x faster than what's being reported here.
that is true of course. Maybe it is not optimized.
@ducbueno can check how you build OPM? grep CXX_FLAGS CMakeCache.txt; grep BUILD_TYPE CMakeCache.txt
Result from grep CXX_FLAGS CMakeCache.txt'
CMAKE_CXX_FLAGS:STRING=-pipe -Wall -Wextra -Wshadow -pthread -fopenmp
CMAKE_CXX_FLAGS_DEBUG:STRING=-g -O0 -DDEBUG
CMAKE_CXX_FLAGS_MINSIZEREL:STRING=-Os -DNDEBUG -O3 -mtune=native
CMAKE_CXX_FLAGS_RELEASE:STRING=-O3 -DNDEBUG -mtune=native
CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG -O3 -mtune=native
OpenMP_CXX_FLAGS:STRING=-fopenmp
//ADVANCED property for variable: CMAKE_CXX_FLAGS
CMAKE_CXX_FLAGS-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_DEBUG
CMAKE_CXX_FLAGS_DEBUG-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_MINSIZEREL
CMAKE_CXX_FLAGS_MINSIZEREL-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_RELEASE
CMAKE_CXX_FLAGS_RELEASE-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_RELWITHDEBINFO
CMAKE_CXX_FLAGS_RELWITHDEBINFO-ADVANCED:INTERNAL=1
//ADVANCED property for variable: OpenMP_CXX_FLAGS
OpenMP_CXX_FLAGS-ADVANCED:INTERNAL=1
Result from grep BUILD_TYPE CMakeCache.txt
:
CMAKE_BUILD_TYPE:STRING=Debug
Ok. That is a debug build (because of CMAKE_BUILD_TYPE:STRING=Debug
the CMAKE_CXX_FLAGS_DEBUG:STRING=-g -O0
debug flags will be used.) You should use CMAKE_BUILD_TYPE=Relase
for benchmarks.
@ducbueno I did send you a script for easier building via email. HTH
Here are my numbers (AMD threadripper 950X 16-Core 3.6 GHz, GPU1: GeForce GTX 1060 6GB, GPU2: AMD Radeon XFX RX580 8GB)
norne/NORNE_ATW2013 | CPU | opencl GC GPU2 | opencl GPU1 | CUDA GPU1 |
---|---|---|---|---|
Total time (s) | 524.18 | 697 | 8992 | 515 |
num linearizations | 1807 | 2250 | 310704 | 1848 |
num newton | 1471 | 1913 | 29726 | 1512 |
num linear iterations | 22545 | 43895 | 298887 | 22292 |
Unfortunately, using the NVIDIA GPU with opencl I see lots of time step chopping. I have added numbers for it, but they are outrageous.
Here are numbers for runs with --matrix-add-well-contributions=true
:
norne/NORNE_ATW2013 | CPU | opencl GC GPU2 | opencl GPU1 | CUDA GPU1 |
---|---|---|---|---|
Total time (s) | 528 | 701 | 727 | 501 |
num linearizations | 1777 | 2162 | 2198 | 1769 |
num newton | 1443 | 1826 | 1861 | 1435 |
num linear iterations | 21879 | 41870 | 41255 | 21092 |
OpenCl on my NVIDIA GPU smells quite fishy.
fishyness goes away if I run with --matrix-add-well-contributions=true
. Something might be wrong with reordering when applying the wells?
Edited my previous comment to add numbers for --matrix-add-well-contributions=true
.
I also used --matrix-add-well-contributions=true
for these:
norne/NORNE_ATW2013 | Dune | cusparse | opencl LS | opencl GC |
---|---|---|---|---|
Total time (s) | 475.94 | 531.70 | 748.76 | 669.25 |
num Linearizations | 1793 | 1780 | 1844 | 2145 |
num Newton Iterations | 1458 | 1445 | 1507 | 1808 |
num Linear Iterations | 21991 | 20912 | 22626 | 41281 |
I took the masters of 2020-9-4 9:00, after https://github.com/OPM/opm-simulators/pull/2762 was merged. And found that opencl flow was not converging normally for norne/NORNE_ATW2013 with separate standardwells.
I could not check https://github.com/OPM/opm-simulators/pull/2816 without also having https://github.com/OPM/opm-simulators/pull/2821 in there. Does anyone have a good date to use with git checkout `git rev-list -n 1 --before="$DATE" master`
?
I did run with 2020-10-2 9:00 which includes both PRs and see the same problems.
I also did not see any exception thrown for Norne.
I retested NORNE_ATW2013_1A_STDW with a higher maximum message count.
Dune is fine and takes 6 minutes. cusparse with wellcontributions in the matrix also takes 6 minutes. cusparse with separate wellcontributions takes more than 1 hour and 15x more linear solves. opencl LS with separate wellcontributions takes even longer with even more linear solves and iterations.
I suspect the WellContributionsOCLContainer also introduced a bug for cusparse, since that still uses the old WellContributions object.
Further testing reveals that cusparse for NORNE_ATW2013_1A_STDW was not working with separate wellcontributions since the PR in March. This behavior is not seen for the normal NORNE_ATW, is there any difference that could explain this?
I tested NORNE_ATW2013_1A_STDW and noticed that Dune and cusparse have the same number of linear solves for the first 2 Report steps, but opencl LS already differs in the first Time step. After a linear solve, StandardWellimpl.hpp:getWellConvergence() is called. The values in resWell[] are the same for Dune and cusparse, but different for opencl. resWell_[] is probably calculated in assembleWellEqWithoutIteration().
Thanks for this investigation. AFAIK the well equations are calculated from the cell values/intensive quantities in the simulator (still, not an expert of the well code). Which would mean that the result of the linear solve might be (quite) different.
One major difference is that we use reordering for openCL. Maybe we should have an option to skip that? That would allow testing it without reordering to further rule out possibilities.
Another option would be to write out a linear system from cusparse, with opencl read that in, solve and compare result. The question is which one to take (probably one at a later stage in the simulation).
Concerning the previous question (Difference Norne_ATW2013 and NORNE_ATW2013_1A_STDW), I am probably not competent enough for a decent answer (and my attempt, to do so now, might turn out to be quite embarrassing). But it appears that the standard Norne is a history matching case, while NORNE_ATW2013_1A_STDW is a prediction case, where the maximum flow from the wells seems limited. But somebody more knowledgeable should comment on this
I made a new branch here. Not using reordering launches 244431 kernels for 1 ILU apply (as oppose to 2167 or 2*19), this increases the runtime extremely.
Usage: --opencl-ilu-reorder=none
I made a new branch here where I was able to completely remove the WellContributionsOCLContainer
class, and the well data is written to the GPU in the same way it is done in CUDA (that is, by "chunks" and before the GPU solver is called). On my Intel integrated graphics the code works flawlessly and on NVidia it chops the time steps in the same way it did with the WellContributionsOCLContainer
class.
This compares NORNE_ATW2013_1A_STDW opencl LS with Dune. Dune was run on a server, opencl LS on my machine and took 1h, 17377 Linearizations and 207774 Linear Iterations. Dune took 458s, 1007 Linearizations and 19935 Linear Iterations.
https://github.com/OPM/opm-simulators/pull/3089#issuecomment-793081786
Actually, I might have one (or make a fool out of me again): I looked at the kernel for standard well application. To me it seems like we are missing some local memory synchronization when we are doing the local reduction on localSum in openclKernels.cpp#L433-L442. At least to me it seems like we reading from memory location that other threads have written, too. But there is no guarantee that the writes are in any order and hence we might try reading values before they have been written. We need to rewrite the code such that there are barrier(CLK_LOCAL_MEM_FENCE); before the summations and we also need to make sure that all workers of a workgroup actually reach these barriers.
Maybe on AMD advice the SIMD is wide enough such that valsPerBlock values are computed at once using vectorization and for my NVIDIA GPU they are not. That might explain the problems that I saw.
I did some quick tests adding barriers at various places, and rewriting the kernel slightly to make sure every thread hits the barriers. I did not see any difference in linear convergence. I also tested our nonpublic csolver, which is a simple, single-threaded CPU linear solver, it was slower than Dune, but produced the same linear convergence. Another possible issue is applying the wells simultaneously, perhaps there are some data hazards. This could occur if multiple wells write to the same rows of the matrix.
Keep in mind that cusparseSolver has the same issue.
Please tell me the branch and I will test. You did not see the problems on your cards, but on my system the NVIDIA had problems and the CPU as an opencl device with POCL.
Running opm-tests/model1/BASE1_MSW_HFA multiple times the number of linear iterations is not the same for every run.