performance benchmark between the different backend ?

12rambau commented 3 years ago

I am using this implementation of BFAST in the Bfast-gpu module of the SEPAL platform and I wanted to check how they behave on a GPU machine.

I did my tests on a G4 machine (4 GPU, 16 GB of RAM) on a 5 year TS over the Juaboso province (1 370 km² in res 30x30m):

backend OpenCL: 27 min
backend Python: 14 min

I was extremely surprised to see the python behaving faster even on GPU machine so I tested it over the test data included in the repository (https://github.com/diku-dk/bfast/blob/master/examples/peru_small_opencl.py):

backend openCL: 192 s
backend Python: 43 s

I'm showing you the mean of the 3 best out 10.

My question is simple what is the interest of the OpenCl implementation if it's not faster than the Python one on GPU machines ? Is it normal ?

mortvest commented 3 years ago

This is definitely not normal. The OpenCL backend should be much faster than the Python backend. What GPU are you using? What version of the code are you using?

mortvest commented 3 years ago

I have just tried running the version from the master branch on my laptop and a GPU machine with NVIDIA A100. On both machines, the OpenCL version is at least three times slower than the Python on peru_small (it takes a while to initialize the GPU environment, the speedup should be much bigger on larger datasets). I have also just tried running it on G4 machine (I apparently still have access to SEPAL), peru_small_opencl ran in 30 seconds with OpenCL.

12rambau commented 3 years ago

Ok so I don't get the exact same results because I transformed your file into a notebook to simplify the display of my results. The order of magnitude between OpenCL and Python remain the same (x3).

If I understand correctly what you're saying, it means that the windowing process I'm doing here https://github.com/12rambau/bfast_gpu/blob/2f4540bd14c2af70720f24229833fb43834155e8/component/scripts/process.py#L127 is actually slowing down the process (As I'm creating as many windows of 512x512 as I can, increasing the number of init the program actually need to run) ?

mortvest commented 3 years ago

It is probably the case. Furthermore, I don't know if the parallelization approach that you are using will work on multiple GPUs, since BFASTMonitor would always choose the OpenCL device with device_id=0, unless told otherwise.

I would create one instance of BFASTMonitor for each GPU (you can use device_id to choose GPU), then run fit() on each window. Alternatively, you can use bigger windows, and use n_chunks, to let BFASTMonitor split the window into smaller chunks. Depends on the size of the datasets that you are using and the available RAM.

12rambau commented 3 years ago

If you have any clue on how to launch multi-process on specific core within Python I would be extremely interested. As far as I know Python multithread pool gives to the OS the responsibility of the core.

I tried to adapt the code from my side (https://github.com/12rambau/bfast_gpu/blob/cceed75a63713f7419ef2dcc4b309899d78b108f/component/scripts/process.py#L65). As you suggested I used 1 single BastMonitor object and run fit on each window. I still don't see any improvement in term of computation speed.

Could you share a working example that demonstrate the time improvement when using the openCL implementation ?

mortvest commented 3 years ago

If you have any clue on how to launch multi-process on specific core within Python I would be extremely interested. As far as I know Python multithread pool gives to the OS the responsibility of the core.

I was talking about working on multiple GPUs, not CPU cores

Could you share a working example that demonstrate the time improvement when using the openCL implementation ?

You can take peru_small_opencl, run it as it is and then set backend to python and run again. Note that _peru_smallopencl.py and _peru_smallpython.py do not use the same dataset, hence their performance should not be compared

12rambau commented 3 years ago

I was talking about working on multiple GPUs, not CPU cores

Sorry if my question wasn't clear, I obviously have difficulties with the OpenCL backend so yes I'm talking about GPU's cores.

mortvest commented 3 years ago

I just looked at your code. How many tiles do you have in your dataset?

You can also try running the code without the multiprocessing part. It can sometimes drastically affect the performance, if something is not set up correctly.

12rambau commented 3 years ago

The number of tile is completely dependent on the surface of the TS AOI (it could be anything from 1 to 100). but tiles are managed sequentially. If you're talking about the sub-tiles then there number is between 4 and 8. I tried to do it sequentially as well and it was drastically slower so I think I'm not so far from something that works (multi-thread is at least reducing data reading, pre-process and cropping)

12rambau commented 3 years ago

You can take peru_small_opencl, run it as it is and then set backend to python and run again. Note that peru_small_opencl.py and peru_small_python.py do not use the same dataset, hence their performance should not be compared

I compared the 2 files in visual studio and it appears that the only difference between these 2 files is the backend line 58. the only difference that you may point out is at line 64 in python the slicing of data but this line is commented.

It means:

1/ that I can compare the 2 results 2/ that I don't have a working example showing that openCl backend is faster than Python

Could you change the openCL one or give me here the files url ?

fnands commented 2 years ago

Hey, so to resurrect this:

I ran into the same issue when running on a new machine.

Same dataset and same code as before, but suddenly the OpenCL version took ~2 hours instead of 3 minutes as it had before.

The issue came down to the fact that platform_id defaults to 0, but for some reason this new device had a simulator as platform 1, so of course simulating a GPU on CPU will be crazy slow. See below:

$ clinfo --list
Platform #0: Oclgrind
 `-- Device #0: Oclgrind Simulator
Platform #1: NVIDIA CUDA
 `-- Device #0: Tesla T4

Fix was trivial, just set platform_id = 1

diku-dk / bfast

performance benchmark between the different backend ? #26