KotlinDL on i5 CPU is 3x faster than Python on RTX2080

wuhanstudio commented 3 years ago

Hi,

Actually, this is not an issue, it's something interesting and intriguing about the performance of KotlinDL.

I trained a dense neural network using both KotlinDL and Python3 on the mnist dataset, and the test result is intriguing:

KotlinDL on a i5 CPU (without AVX) is 3x faster than Python on a RTX2080 GPU.

https://github.com/wuhanstudio/kotlindl-mnist

I'm wondering is KotlinDL really so fast or there is something wrong with my settings ;)

(Or Python is really so slow ....)

demo

zaleslaw commented 3 years ago

First of all, thanks for spending time on this benchmark! It gives us food for thinking.

In both cases, it doesn't depend on Kotlin or Python. In both cases, it depends on the runtime used in the program (KotlinDL uses TF 1.15 runtime, Keras uses TF 2.6 or 2.5 which (surprise, could be slow in some cases), the place where the dataset is stored (I suppose that Python Keras reads from disk again and again, but KotlinDL uses OnHeapDataset which caches the whole mnist dataset in the heap jvm memory)

Also, KotlinDL uses Floats by default for calculating, it could be 2x faster than Double usage, for example and so on...

Probably the model is not so big to get a speed boost on GPU...

But I am happy that training from scratch using KotlinDL is much more efficient than using modern Keras...

Give me some time to profile it correctly.

wuhanstudio commented 3 years ago

First of all, thanks for spending time on this benchmark! It gives us food for thinking.

In both cases, it doesn't depend on Kotlin or Python. In both cases, it depends on the runtime used in the program (KotlinDL uses TF 1.15 runtime, Keras uses TF 2.6 or 2.5 which (surprise, could be slow in some cases), the place where the dataset is stored (I suppose that Python Keras reads from disk again and again, but KotlinDL uses OnHeapDataset which caches the whole mnist dataset in the heap jvm memory)

Also, KotlinDL uses Floats by default for calculating, it could be 2x faster than Double usage, for example and so on...

Probably the model is not so big to get a speed boost on GPU...

But I am happy that training from scratch using KotlinDL is much more efficient than using modern Keras...

Give me some time to profile it correctly.

Thanks for your reply.

I used the same runtime (tensorflow 1.15 + CUDA 10.0 + cudnn 7.6.5) for both python and kotlin. Indeed, perhaps the model is not large enough to make full use of a GPU. As a result, the data transfer between CPU and GPU causes extra overhead.

I'll further test this on python CPU and KotlinDL CPU, so that they have the same runtime and hardware.

Python GPU (RTX2080): 90.90s, 89% accuracy.
Python CPU (Intel i5): 75.28s, 88% accuracy.
KotlinDL CPU (Intel i5): 44.88s, 98% accuracy.

Seems KotlinDL is still the fastest one on CPU.

Besides, the training process somehow converges much faster using KotlinDL, and usually yields a model with higher accuracy.

zaleslaw commented 2 years ago

Hi @wuhanstudio , could I suggest you check the following scenario, based on deeper VGG network and bigger dataset, cifar10 (it also embedded dataset in Keras)

The basic code for VGG on Cifar'10 training can be found here

In this case, KotlinDL does not cache the whole data in JVM memory and will act the same way as Keras.

Let's do together this research and see the results, I hope to include your feedback/benchmark to the presentation or blogpost about KotlinDL

wuhanstudio commented 2 years ago

Hi,

I just updated my repo to include the test for VGG11 on Cifar10.

Kotlin (~250s): https://github.com/wuhanstudio/kotlindl-mnist/blob/master/src/main/kotlin/cirar10/vgg11.kt

Python (~125s): https://github.com/wuhanstudio/kotlindl-mnist/blob/master/python/vgg11.py

At this time, JVM is 2x slower than Keras on the same CPU.

zaleslaw commented 2 years ago

Could you measure the same for 10, 50, 100 epochs instead of 3 by default on CPU and on the same GPU?

Did you know how to run KotlinDL on GPU?

Kotlin uses JNI internally to call C API TensorFlow, it could gives some overhead but not so many...

wuhanstudio commented 2 years ago

Hi, could you give some advice on using KotlinDL with anaconda?

Is there any way to use CUDA and cudnn installed in an anaconda environment by changing environment variables?

The GPU server in our lab uses CUDA 11.0 and cuddn 8.x, but KotlinDL requires CUDA 10.0 + cudnn 7.6.3, and here's the log that fails to find the cuddn in anaconda:

2021-11-14 11:42:22.529784: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA                                                                                                                                   2021-11-14 11:42:22.555399: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3699850000 Hz                                          2021-11-14 11:42:22.556232: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f361d9202b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-14 11:42:22.556263: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-11-14 11:42:22.557090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-11-14 11:42:22.684207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1a:00.0
2021-11-14 11:42:22.684564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:68:00.0
2021-11-14 11:42:22.684668: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: lib
cudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/melodic/lib
2021-11-14 11:42:22.684725: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: lib
cublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/melodic/lib
2021-11-14 11:42:22.684767: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libc
ufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/melodic/lib
2021-11-14 11:42:22.684800: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: lib
curand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/melodic/lib
2021-11-14 11:42:22.684836: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: l
ibcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/melodic/lib
2021-11-14 11:42:22.684873: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/melodic/lib
2021-11-14 11:42:22.684908: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/melodic/lib
2021-11-14 11:42:22.684914: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-11-14 11:42:22.684940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-11-14 11:42:22.684948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 1
2021-11-14 11:42:22.684955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N N
2021-11-14 11:42:22.684961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1:   N N

zaleslaw commented 2 years ago

Unfortunately, I don't know it is possible with an anaconda or not (highly likely that is not possible); in my case, I reinstall drivers for video cards manually, and it works fine

wuhanstudio commented 2 years ago

Hi, these are my test results on CPU in seconds:

Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz	10	50	100
Kotlin (CPU)	266.936	1318.638	2785.489
Python (CPU)	299.029	1977.897	3719.283

Unfortunately, two RTX2080Ti run out of memory while training this model.

zaleslaw commented 2 years ago

Great data, thanks @wuhanstudio ! Looks very good, thanks for the great work!

Kotlin / kotlindl

KotlinDL on i5 CPU is 3x faster than Python on RTX2080 #263