Closed krischer closed 5 years ago
I understand yours concerns about the performance and I'll try to give you the best explanation for that issue. The reason why the GPU version is just roughly 6 time faster than the CPU is mainly due to the dimension of the dataset. In fact, the software is intended to work on bigger batches of data rather than just those used here as example tests. In addition, to be honest, I didn't take into account the case of comparing my software neither with a parallel version software or an optimized compiled version.
Although this comment might not be the best answer to your question, I've also found here the following issue with compiling a CUDA programma with the -03 option
but i don't know whether this is the case or not.
Sadly, I cannot make my GPU version faster for this task since this version for the Multi Dimensional Dynamic Time Warping is the best i could write. Anyhow, whoever will use the software will always benefit from it especially when the size of the dataset to handle is made of more than one thousand of time series.
I'll try the other performance flags for the CUDA compilation once you deem the code ready for another review round! Can you provide a large enough data set to properly test the software by then?
I do think that the GPU version should be quite a bit faster (especially given that the machine I tested on has a very fast GPU and only a mid-range CPU). If that is not the case: Maybe there are some easy to figure out bottlenecks? Did you try this: https://developer.nvidia.com/nvidia-visual-profiler It's really very powerful.
I've uploaded two big dataset at the following link. The information about these datasets are the summarized down here:
Name | # variables | # classes | Train set size | Test set size | Dataset size | Time series length |
---|---|---|---|---|---|---|
UCI_CHAR | 9 | 6 | 7352 | 2947 | 10299 | 561 |
Motor imagery in ECoG recordings | 64 | 2 | 278 | 100 | 378 | 3000 |
You can run the software on both CPU and GPU by using the following commands:
CPU:
./mdtwObj -t CLASSIFICATION -i CPU 3 0 -f <path_dataset>/DATA <path_dataset>/LABEL -k 10 0 -o 10299 561 -m 0 DTW -d 0
GPU :
./mdtwObj -t CLASSIFICATION -i GPU 3 <num_threads> 0 -f <path_dataset>/DATA <path_dataset>/LABEL -k 10 0 -o 10299 561 -m 0 DTW -d 0
I do know the NVIDIA Visual Profiler tool and I will try to use it to understand whether there's a bottleneck on my code. At first glance, i might say to you that a likely bottleneck might lie in transferring the data between the host and the device.
I encounter similar results (and thus conclusions) as @krischer . In my case, the example referenced in the first comment runs in 120 seconds on the CPU (Xeon E3-1240 v3) and 9 seconds on the GPU (GTX 760).
Keeping in mind that the CPU version
short_dtw_c()
shows inefficient multi-dimensional arrays with malloc
s and free
s that could definitely be optimized further)the claims of three orders of magnitude performance gains in the paper cannot be supported.
Part of a review at: https://github.com/openjournals/joss-reviews/issues/1049
I cannot reproduce the performance claims of the software.
I ran some performance test on a single Nvidia Pascal Titan X GPU and the CPU version on a single core of a Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz CPU. I first ran the following two tests:
Results are here: classification_CPU.txt, classification_GPU.txt
The GPU version is about a factor of 27 times faster then the single threaded CPU version which seems more realistic than the quoted three orders of magnitude. Also this is only a single core (this particular CPU has 6 cores/12 threads) so the code would have to be parallelized for a fair comparison.
Simply compiling the code with
-O3
makes the CPU version only about 6 times slower than the GPU version (results are here: ). The optimisation flags don’t make the GPU code run any faster. Given that this could be made even 6 times faster with parallelisation the claim that the GPU version is faster than the CPU version does not hold even without comparing to other implementations.classification_CPU_O3.txt, classification_GPU_03.txt
Part of a review at: https://github.com/openjournals/joss-reviews/issues/1049