Open 12rambau opened 3 years ago
This is definitely not normal. The OpenCL backend should be much faster than the Python backend. What GPU are you using? What version of the code are you using?
I have just tried running the version from the master branch on my laptop and a GPU machine with NVIDIA A100. On both machines, the OpenCL version is at least three times slower than the Python on peru_small (it takes a while to initialize the GPU environment, the speedup should be much bigger on larger datasets). I have also just tried running it on G4 machine (I apparently still have access to SEPAL), peru_small_opencl ran in 30 seconds with OpenCL.
Ok so I don't get the exact same results because I transformed your file into a notebook to simplify the display of my results. The order of magnitude between OpenCL and Python remain the same (x3).
If I understand correctly what you're saying, it means that the windowing process I'm doing here https://github.com/12rambau/bfast_gpu/blob/2f4540bd14c2af70720f24229833fb43834155e8/component/scripts/process.py#L127 is actually slowing down the process (As I'm creating as many windows of 512x512 as I can, increasing the number of init the program actually need to run) ?
It is probably the case. Furthermore, I don't know if the parallelization approach that you are using will work on multiple GPUs, since BFASTMonitor would always choose the OpenCL device with device_id=0, unless told otherwise.
I would create one instance of BFASTMonitor for each GPU (you can use device_id to choose GPU), then run fit() on each window. Alternatively, you can use bigger windows, and use n_chunks, to let BFASTMonitor split the window into smaller chunks. Depends on the size of the datasets that you are using and the available RAM.
If you have any clue on how to launch multi-process on specific core within Python I would be extremely interested. As far as I know Python multithread pool gives to the OS the responsibility of the core.
I tried to adapt the code from my side (https://github.com/12rambau/bfast_gpu/blob/cceed75a63713f7419ef2dcc4b309899d78b108f/component/scripts/process.py#L65). As you suggested I used 1 single BastMonitor object and run fit on each window. I still don't see any improvement in term of computation speed.
Could you share a working example that demonstrate the time improvement when using the openCL implementation ?
If you have any clue on how to launch multi-process on specific core within Python I would be extremely interested. As far as I know Python multithread pool gives to the OS the responsibility of the core.
I was talking about working on multiple GPUs, not CPU cores
Could you share a working example that demonstrate the time improvement when using the openCL implementation ?
You can take peru_small_opencl, run it as it is and then set backend to python and run again. Note that _peru_smallopencl.py and _peru_smallpython.py do not use the same dataset, hence their performance should not be compared
I was talking about working on multiple GPUs, not CPU cores
Sorry if my question wasn't clear, I obviously have difficulties with the OpenCL backend so yes I'm talking about GPU's cores.
I just looked at your code. How many tiles do you have in your dataset?
You can also try running the code without the multiprocessing part. It can sometimes drastically affect the performance, if something is not set up correctly.
The number of tile is completely dependent on the surface of the TS AOI (it could be anything from 1 to 100). but tiles are managed sequentially. If you're talking about the sub-tiles then there number is between 4 and 8. I tried to do it sequentially as well and it was drastically slower so I think I'm not so far from something that works (multi-thread is at least reducing data reading, pre-process and cropping)
You can take peru_small_opencl, run it as it is and then set backend to python and run again. Note that peru_small_opencl.py and peru_small_python.py do not use the same dataset, hence their performance should not be compared
I compared the 2 files in visual studio and it appears that the only difference between these 2 files is the backend
line 58.
the only difference that you may point out is at line 64 in python the slicing of data but this line is commented.
It means:
1/ that I can compare the 2 results 2/ that I don't have a working example showing that openCl backend is faster than Python
Could you change the openCL one or give me here the files url ?
Hey, so to resurrect this:
I ran into the same issue when running on a new machine.
Same dataset and same code as before, but suddenly the OpenCL version took ~2 hours instead of 3 minutes as it had before.
The issue came down to the fact that platform_id
defaults to 0, but for some reason this new device had a simulator as platform 1, so of course simulating a GPU on CPU will be crazy slow.
See below:
$ clinfo --list
Platform #0: Oclgrind
`-- Device #0: Oclgrind Simulator
Platform #1: NVIDIA CUDA
`-- Device #0: Tesla T4
Fix was trivial, just set platform_id = 1
I am using this implementation of BFAST in the Bfast-gpu module of the SEPAL platform and I wanted to check how they behave on a GPU machine.
I did my tests on a G4 machine (4 GPU, 16 GB of RAM) on a 5 year TS over the Juaboso province (1 370 km² in res 30x30m):
I was extremely surprised to see the python behaving faster even on GPU machine so I tested it over the test data included in the repository (https://github.com/diku-dk/bfast/blob/master/examples/peru_small_opencl.py):
I'm showing you the mean of the 3 best out 10.
My question is simple what is the interest of the OpenCl implementation if it's not faster than the Python one on GPU machines ? Is it normal ?