[OpenCL] Number of linear iterations fluctuate between multiple run of same model

blattms commented 4 years ago

Running opm-tests/model1/BASE1_MSW_HFA multiple times the number of linear iterations is not the same for every run.

GitPaean commented 4 years ago

I tried both parallel and serial, I did not reproduce your problem.

There are other people also complaining about some random behavior, unfortunately, I did not reproduce either.

Maybe some linear solution component in the system setup has some components can introduce some randomness in the solution procedure?

blattms commented 4 years ago

Probably should have stated this in the text also (not just as a tag in the title).

The problem is only in the OpenCL code (if you explicitly ask for it with --gpu-mode=opencl).

GitPaean commented 4 years ago

Okay. I did not notice the OpenCL in the title.

I was thinking it is good to be able to reproduce and find out why about the randomness reported from some colleagues.

blattms commented 4 years ago

Reroducing is always highly appreciated. Thanks.

Maybe @ducbueno wants to do that?

blattms commented 4 years ago

This problem persists now that both std wells and ms wells are implemented. I get fluctuating number of iterations for SPE9 with current master (std well only) and with #2821 (readds ms wells). It happens both with --matrix-add-well-contributions=true and false. For Cusparse there are no fluctuations. I guess we need to check the ILU and bicgstab implementations.

ducbueno commented 4 years ago

I believe the fluctuations are due to the OpenCL BiCGStab implementation always using the zero vector an initial guess. There may some other issues with the ILU, but I haven't checked it

Tongdongq commented 4 years ago

The cusparseSolver also uses a zero vector as initial guess with cudaMemsetAsync(d_x, 0, sizeof(double) * N, stream); in update_system_on_gpu() and copy_system_to_gpu().

Tongdongq commented 4 years ago

@blattms @ducbueno I just remembered that the default OpenCL coloring strategy is set to GRAPH_COLORING as oppose to LEVEL_SCHEDULING in openclSolverBackend.cpp. This is generally faster, since it respects dependencies less. This results in more parallelism, but also in more linear iterations needed (up to 2x is normal). What's more, the graph coloring strategy has a random output. We could patch this by replacing https://github.com/OPM/opm-simulators/blob/ac3004da9deaf8841eade3c4394f5f5c6cffb95b/opm/simulators/linalg/bda/Reorder.cpp#L60 with std::mt19937 gen(constantSeed); This should solve the randomness issue, but the number of iterations will still be significantly higher than with LEVEL_SCHEDULING (cusparse and dune also use level scheduling).

blattms commented 4 years ago

Thanks for the clarification/explanation. Not sure which is way forward is the best here. IF we want to change someone should test it.

At least we should document somewhere prominent that there is randomness and the iteration numbers will differ. Maybe in the code and a file doc/READMES_GPU.txt? Other suggestions are welcome.

ducbueno commented 4 years ago

I can run some tests with the level scheduling reordering and also with the fixed random seed on the graph coloring.

I may take a little while to report back since today is a holiday in Brazil and I'll be away from the computer

Tongdongq commented 4 years ago

We could remove the randomness altogether. But it's not guaranteed to find a suitable coloring. If it fails, it should retry with a different seed. I'll run some tests too.

Tongdongq commented 4 years ago

With masters of 12 Oct 2020, 09:00. AMD Ryzen 5 2400G NVIDIA GTX 1050Ti CentOS 7, gcc 7.3.1 OpenCL 1.2 CUDA 11.1.70 driver: 455.23.05

opm-tests: LS: LEVEL_SCHEDULING GC: GRAPH_COLORING Seed GC: 0x5daefded The number of Linear Iterations for GC is constant now (with my local edit).

model1/BASE1_MSW_HFA	Dune	cusparse	opencl LS	opencl GC
Total time (s)	0.13	1.23	0.35	0.35
num Linearizations	21	21	21	21
num Newton Iterations	14	14	14	14
num Linear Iterations	58	53	56	137

model1/BASE1_MSW_HFA STDW	Dune	cusparse	opencl LS	opencl GC
Total time (s)	0.06	0.55	0.47	0.64
num Linearizations	21	21	61	62
num Newton Iterations	14	14	54	55
num Linear Iterations	56	47	192	511

model1/BASE1_MSW_HFA STDW in matrix	Dune	cusparse	opencl LS	opencl GC
Total time (s)	0.06	0.52	0.22	0.29
num Linearizations	21	21	21	21
num Newton Iterations	14	14	14	14
num Linear Iterations	54	45	54	132

model1/BASE2_MSW_HFA	Dune	cusparse	opencl LS	opencl GC
Total time (s)	0.26	1.05	5.8	7.28
num Linearizations	53	53	390	380
num Newton Iterations	38	38	369	359
num Linear Iterations	273	253	4605	7794

model1/BASE2_MSW_HFA STDW	Dune	cusparse	opencl LS	opencl GC
Total time (s)	0.20	0.76	2.81	3.41
num Linearizations	50	51	373	347
num Newton Iterations	35	36	353	328
num Linear Iterations	253	234	2476	4560

model1/BASE2_MSW_HFA STDW in matrix	Dune	cusparse	opencl LS	opencl GC
Total time (s)	0.20	1.28	0.54	0.57
num Linearizations	55	56	50	53
num Newton Iterations	40	41	35	38
num Linear Iterations	250	247	274	654

model1/BASE3_MSW_HFA	Dune	cusparse	opencl LS	opencl GC
Total time (s)	0.12	0.64	0.34	0.40
num Linearizations	24	24	24	24
num Newton Iterations	16	16	16	16
num Linear Iterations	100	91	95	233

model2/0_BASE_MODEL2.DATA	Dune	cusparse	opencl LS	opencl GC
Total time (s)	6.97	11.75	23.29	19.03
num Linearizations	486	463	601	597
num Newton Iterations	433	410	545	541
num Linear Iterations	5767	5227	9117	15218

model2/0_BASE_MODEL2.DATA STDW	Dune	cusparse	opencl LS	opencl GC
Total time (s)	6.68	11.15	21.88	19.19
num Linearizations	486	463	601	597
num Newton Iterations	433	410	545	541
num Linear Iterations	5767	5227	9117	15218

model2/0_BASE_MODEL2.DATA STDW in matrix	Dune	cusparse	opencl LS	opencl GC
Total time (s)	6.41	11.09	14.88	13.24
num Linearizations	501	458	463	478
num Newton Iterations	447	405	410	424
num Linear Iterations	5669	5040	5917	10636

For norne/NORNE_ATW2013_1A_STDW.DATA, ~~opencl LS gets stuck at Report step 21, opencl GS at Report step 17, and cusparse at Report step 24~~. Dune is fine and takes 6 minutes. For norne/NORNE_ATW2013_1A_MSW.DATA, opencl LS gets stuck at Report step 60 after 80 minutes, opencl GS at Report step 39, cusparse at Report step 62 after 60 minutes. Dune finishes but takes 82 minutes with 26860 Linearizations and 139483 Linear Iterations. That's much higher than I expected. Apparently it had to shutdown 7 wells.

blattms commented 4 years ago

What does "get stuck" mean? chops the time step until it gives ups and throws an exception?

Tongdongq commented 4 years ago

Some I actually killed myself, some quit without a thrown exception after a few subsequent

Problem: Solver convergence failure - Iteration limit reached
Timestep chopped to x.xxx days

I'll rerun some tests to be more precise. There are some wells that are shutdown due to not being able to be converged. Is this expected for Dune?

blattms commented 4 years ago

Thanks for the numbers. Would you add total time to the tables please.

To sum this up:

the fluctuating linear iterations are due to the randomness in the coloring algorithm.
we still have very different number of linear iterations for some models. I am aware that this might happen due to different orderings, but a factor 20 (like for model1/BASE2_MSW_HFA) seems a bit much for that. Albeit that comes partly from more Newton steps.
Overall the opencl is rather unstable compared to DUNE and cusparse.

For the big discrepancies, the number of newton steps increases drastically, too. Is that due to more time step chopping?

@Tongdongq as you probably have the best overview, would it be possible to summarize the known differences between the cusparse and openCL implementations. Maybe there is some striking difference?

These are all multisegment well problems. Do we see the same behavior for standard wells (e.g. with --use-multisegment-well=true parameter for the same models), too? If that is the case. additionallly how is it with --use-multisegment-well=true --matrix-add-well-contributions=true? If we see it there we could write out a problematic linear system and try to solve that with cusparse and see how it behaves.

Tongdongq commented 4 years ago

Only the GRAPH_COLORING will give randomness (if we don't choose a constant seed). The increased number of Newton steps seems to me like the openclSolver does not provide the same 'quality' in the solution, which means the outer loop has to do more iterations to reach its required levels of convergence.

One of the biggest differences is the reordering. The ILU decomposition in cusparse is handled by the library. For opencl it is done manually on the CPU and includes reordering of the rows. Although when I added the openclSolver to the masters of May, it did converge in reasonable time.

Do you mean --use-multisegment-well=false to run those models with standard wells?

blattms commented 4 years ago

Do you mean --use-multisegment-well=false to run those models with standard wells?

Yes that was what I meant.

ducbueno commented 4 years ago

Just reporting some results I got running the tests on my machine (with an Intel HD Graphics 620).

model1/BASE1_MSW_HFA	Dune	opencl LS	opencl GC
Total time (s)	0,94	1,45	1,59
num linearizations	21	21	21
num newton	14	14	14
num linear iterations	58	56	132

model1/BASE2_MSW_HFA	Dune	opencl LS	opencl GC
Total time (s)	2,88	3,31	4,25
num linearizations	54	51	51
num newton	39	36	36
num linear iterations	269	268	597

model1/BASE3_MSW_HFA	Dune	opencl LS	opencl GC
Total time (s)	1,09	1,59	1,91
num linearizations	24	24	24
num newton	16	16	16
num linear iterations	100	95	222

model2/0_BASE_MODEL2	Dune	opencl LS	opencl GC
Total time (s)	298,16	119,5	100,89
num linearizations	486	418	479
num newton	433	366	426
num linear iterations	5767	6680	10241

I didn't get crazy linear iteration numbers with BASE2_MSW_HFA, and I'm also running norne and it also doesn't seem to break on my system (with opencl). I'll soon upload my norne results.

blattms commented 4 years ago

Great.

Hence it might depend on the hardware / opencl version etc. @Tongdongq what hardware was that,

Please add information as reported via "Platform version" and "CL_DEVICE_VERSION".

Tongdongq commented 4 years ago

It looks like standardwells applied separately causes problems on my machine. When putting them in the matrix, the number iterations for opencl (LS/GC) is normal. This also happened for a quick test with model1/BASE3_MSW_HFA.

ducbueno commented 4 years ago

Norne results with Intel HD Graphics 620.

norne/NORNE_ATW2013	Dune	opencl LS	opencl GC
Total time (s)	20297,07	6065,88	6919,48
num linearizations	1806	1838	2362
num newton	1470	1502	2021
num linear iterations	22509	23582	45148

Still haven't had time to simulate Norne with the multisegment wells. As soon as I have something I'll post

atgeirr commented 4 years ago

I find those timings for Norne strange, on my computer I can run Norne in less than 10 minutes in a serial run, down to about 195 seconds with 8 processes (one thread per process). This seems a lot slower, also for the Dune version, which I assume is the normal, default CPU-based solver?

blattms commented 4 years ago

Well, run times might depend on the machine used (memory speed, etc). Maybe not everybody has a powerful machine at home. I am running this on my machine and will report back(with opencl and graph coloring the iterations are roughly in the same ball park, but on my system it is 10x faster).

bska commented 4 years ago

Well, run times might depend on the machine used (memory speed, etc). Maybe not everybody has a powerful machine at home.

Obviously, but even on my 2014 vintage laptop I'm able to run a sequential simulation of the base NORNE_ATW2013 case in about 800 seconds. That's roughly ~~2.5x~~ 25x faster than what's being reported here.

blattms commented 4 years ago

that is true of course. Maybe it is not optimized. @ducbueno can check how you build OPM? grep CXX_FLAGS CMakeCache.txt; grep BUILD_TYPE CMakeCache.txt

ducbueno commented 4 years ago

Result from grep CXX_FLAGS CMakeCache.txt'

CMAKE_CXX_FLAGS:STRING=-pipe -Wall -Wextra -Wshadow  -pthread -fopenmp
CMAKE_CXX_FLAGS_DEBUG:STRING=-g -O0 -DDEBUG
CMAKE_CXX_FLAGS_MINSIZEREL:STRING=-Os -DNDEBUG -O3 -mtune=native
CMAKE_CXX_FLAGS_RELEASE:STRING=-O3 -DNDEBUG -mtune=native
CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG -O3 -mtune=native
OpenMP_CXX_FLAGS:STRING=-fopenmp
//ADVANCED property for variable: CMAKE_CXX_FLAGS
CMAKE_CXX_FLAGS-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_DEBUG
CMAKE_CXX_FLAGS_DEBUG-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_MINSIZEREL
CMAKE_CXX_FLAGS_MINSIZEREL-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_RELEASE
CMAKE_CXX_FLAGS_RELEASE-ADVANCED:INTERNAL=1
//ADVANCED property for variable: CMAKE_CXX_FLAGS_RELWITHDEBINFO
CMAKE_CXX_FLAGS_RELWITHDEBINFO-ADVANCED:INTERNAL=1
//ADVANCED property for variable: OpenMP_CXX_FLAGS
OpenMP_CXX_FLAGS-ADVANCED:INTERNAL=1

Result from grep BUILD_TYPE CMakeCache.txt:

CMAKE_BUILD_TYPE:STRING=Debug

blattms commented 4 years ago

Ok. That is a debug build (because of CMAKE_BUILD_TYPE:STRING=Debug the CMAKE_CXX_FLAGS_DEBUG:STRING=-g -O0 debug flags will be used.) You should use CMAKE_BUILD_TYPE=Relase for benchmarks.

blattms commented 4 years ago

@ducbueno I did send you a script for easier building via email. HTH

blattms commented 4 years ago

Here are my numbers (AMD threadripper 950X 16-Core 3.6 GHz, GPU1: GeForce GTX 1060 6GB, GPU2: AMD Radeon XFX RX580 8GB)

norne/NORNE_ATW2013	CPU	opencl GC GPU2	opencl GPU1	CUDA GPU1
Total time (s)	524.18	697	8992	515
num linearizations	1807	2250	310704	1848
num newton	1471	1913	29726	1512
num linear iterations	22545	43895	298887	22292

Unfortunately, using the NVIDIA GPU with opencl I see lots of time step chopping. I have added numbers for it, but they are outrageous.

Here are numbers for runs with --matrix-add-well-contributions=true:

norne/NORNE_ATW2013	CPU	opencl GC GPU2	opencl GPU1	CUDA GPU1
Total time (s)	528	701	727	501
num linearizations	1777	2162	2198	1769
num newton	1443	1826	1861	1435
num linear iterations	21879	41870	41255	21092

blattms commented 4 years ago

OpenCl on my NVIDIA GPU smells quite fishy.

blattms commented 4 years ago

fishyness goes away if I run with --matrix-add-well-contributions=true. Something might be wrong with reordering when applying the wells?

blattms commented 4 years ago

Edited my previous comment to add numbers for --matrix-add-well-contributions=true.

Tongdongq commented 4 years ago

I also used --matrix-add-well-contributions=true for these:

norne/NORNE_ATW2013	Dune	cusparse	opencl LS	opencl GC
Total time (s)	475.94	531.70	748.76	669.25
num Linearizations	1793	1780	1844	2145
num Newton Iterations	1458	1445	1507	1808
num Linear Iterations	21991	20912	22626	41281

Tongdongq commented 4 years ago

I took the masters of 2020-9-4 9:00, after https://github.com/OPM/opm-simulators/pull/2762 was merged. And found that opencl flow was not converging normally for norne/NORNE_ATW2013 with separate standardwells. I could not check https://github.com/OPM/opm-simulators/pull/2816 without also having https://github.com/OPM/opm-simulators/pull/2821 in there. Does anyone have a good date to use with git checkout `git rev-list -n 1 --before="$DATE" master`? I did run with 2020-10-2 9:00 which includes both PRs and see the same problems. I also did not see any exception thrown for Norne.

Tongdongq commented 4 years ago

I retested NORNE_ATW2013_1A_STDW with a higher maximum message count. Dune is fine and takes 6 minutes. cusparse with wellcontributions in the matrix also takes 6 minutes. cusparse with separate wellcontributions takes more than 1 hour and 15x more linear solves. opencl LS with separate wellcontributions takes even longer with even more linear solves and iterations. ~~I suspect the WellContributionsOCLContainer also introduced a bug for cusparse, since that still uses the old WellContributions object.~~ Further testing reveals that cusparse for NORNE_ATW2013_1A_STDW was not working with separate wellcontributions since the PR in March. This behavior is not seen for the normal NORNE_ATW, is there any difference that could explain this?

Tongdongq commented 4 years ago

I tested NORNE_ATW2013_1A_STDW and noticed that Dune and cusparse have the same number of linear solves for the first 2 Report steps, but opencl LS already differs in the first Time step. After a linear solve, StandardWellimpl.hpp:getWellConvergence() is called. The values in resWell[] are the same for Dune and cusparse, but different for opencl. resWell_[] is probably calculated in assembleWellEqWithoutIteration().

blattms commented 4 years ago

Thanks for this investigation. AFAIK the well equations are calculated from the cell values/intensive quantities in the simulator (still, not an expert of the well code). Which would mean that the result of the linear solve might be (quite) different.

One major difference is that we use reordering for openCL. Maybe we should have an option to skip that? That would allow testing it without reordering to further rule out possibilities.

Another option would be to write out a linear system from cusparse, with opencl read that in, solve and compare result. The question is which one to take (probably one at a later stage in the simulation).

Concerning the previous question (Difference Norne_ATW2013 and NORNE_ATW2013_1A_STDW), I am probably not competent enough for a decent answer (and my attempt, to do so now, might turn out to be quite embarrassing). But it appears that the standard Norne is a history matching case, while NORNE_ATW2013_1A_STDW is a prediction case, where the maximum flow from the wells seems limited. But somebody more knowledgeable should comment on this

Tongdongq commented 4 years ago

I made a new branch here. Not using reordering launches 244431 kernels for 1 ILU apply (as oppose to 2167 or 2*19), this increases the runtime extremely. Usage: --opencl-ilu-reorder=none

ducbueno commented 4 years ago

I made a new branch here where I was able to completely remove the WellContributionsOCLContainer class, and the well data is written to the GPU in the same way it is done in CUDA (that is, by "chunks" and before the GPU solver is called). On my Intel integrated graphics the code works flawlessly and on NVidia it chops the time steps in the same way it did with the WellContributionsOCLContainer class.

Tongdongq commented 3 years ago

WBHP-E-3AH This compares NORNE_ATW2013_1A_STDW opencl LS with Dune. Dune was run on a server, opencl LS on my machine and took 1h, 17377 Linearizations and 207774 Linear Iterations. Dune took 458s, 1007 Linearizations and 19935 Linear Iterations.

Tongdongq commented 3 years ago

https://github.com/OPM/opm-simulators/pull/3089#issuecomment-793081786

Actually, I might have one (or make a fool out of me again): I looked at the kernel for standard well application. To me it seems like we are missing some local memory synchronization when we are doing the local reduction on localSum in openclKernels.cpp#L433-L442. At least to me it seems like we reading from memory location that other threads have written, too. But there is no guarantee that the writes are in any order and hence we might try reading values before they have been written. We need to rewrite the code such that there are barrier(CLK_LOCAL_MEM_FENCE); before the summations and we also need to make sure that all workers of a workgroup actually reach these barriers.

Maybe on AMD advice the SIMD is wide enough such that valsPerBlock values are computed at once using vectorization and for my NVIDIA GPU they are not. That might explain the problems that I saw.

I did some quick tests adding barriers at various places, and rewriting the kernel slightly to make sure every thread hits the barriers. I did not see any difference in linear convergence. I also tested our nonpublic csolver, which is a simple, single-threaded CPU linear solver, it was slower than Dune, but produced the same linear convergence. Another possible issue is applying the wells simultaneously, perhaps there are some data hazards. This could occur if multiple wells write to the same rows of the matrix.

Keep in mind that cusparseSolver has the same issue.

blattms commented 3 years ago

Please tell me the branch and I will test. You did not see the problems on your cards, but on my system the NVIDIA had problems and the CPU as an opencl device with POCL.

Tongdongq commented 3 years ago

https://github.com/Tongdongq/opm-simulators/tree/add-memory-barrier-opencl-stdwell-apply

OPM / opm-simulators

[OpenCL] Number of linear iterations fluctuate between multiple run of same model #2755