Performance of GPU version

krischer commented 5 years ago

I cannot reproduce the performance claims of the software.

I ran some performance test on a single Nvidia Pascal Titan X GPU and the CPU version on a single core of a Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz CPU. I first ran the following two tests:

# Classification CPU
time ./mdtwObj -t CLASSIFICATION -i CPU 3 1 \
    -f data/classification/rm_1/X_MAT data/classification/rm_1/Y_MAT data/classification/rm_1/Z_MAT \
    -k 10 -o 1000 152 -m 0 DTW 2>&1 | tee classification_CPU.txt

# Classification GPU
time ./mdtwObj -t CLASSIFICATION -i GPU 3 512 1 \
    -f data/classification/rm_1/X_MAT data/classification/rm_1/Y_MAT data/classification/rm_1/Z_MAT \
    -k 10 -o 1000 152 -m 0 DTW -d 0 2>&1 | tee classification_GPU.txt

Results are here: classification_CPU.txt, classification_GPU.txt

The GPU version is about a factor of 27 times faster then the single threaded CPU version which seems more realistic than the quoted three orders of magnitude. Also this is only a single core (this particular CPU has 6 cores/12 threads) so the code would have to be parallelized for a fair comparison.

Simply compiling the code with -O3 makes the CPU version only about 6 times slower than the GPU version (results are here: ). The optimisation flags don’t make the GPU code run any faster. Given that this could be made even 6 times faster with parallelisation the claim that the GPU version is faster than the CPU version does not hold even without comparing to other implementations.

classification_CPU_O3.txt, classification_GPU_03.txt

Part of a review at: https://github.com/openjournals/joss-reviews/issues/1049

DavideNardone commented 5 years ago

I understand yours concerns about the performance and I'll try to give you the best explanation for that issue. The reason why the GPU version is just roughly 6 time faster than the CPU is mainly due to the dimension of the dataset. In fact, the software is intended to work on bigger batches of data rather than just those used here as example tests. In addition, to be honest, I didn't take into account the case of comparing my software neither with a parallel version software or an optimized compiled version.

Although this comment might not be the best answer to your question, I've also found here the following issue with compiling a CUDA programma with the -03 option but i don't know whether this is the case or not.

Sadly, I cannot make my GPU version faster for this task since this version for the Multi Dimensional Dynamic Time Warping is the best i could write. Anyhow, whoever will use the software will always benefit from it especially when the size of the dataset to handle is made of more than one thousand of time series.

krischer commented 5 years ago

I'll try the other performance flags for the CUDA compilation once you deem the code ready for another review round! Can you provide a large enough data set to properly test the software by then?

I do think that the GPU version should be quite a bit faster (especially given that the machine I tested on has a very fast GPU and only a mid-range CPU). If that is not the case: Maybe there are some easy to figure out bottlenecks? Did you try this: https://developer.nvidia.com/nvidia-visual-profiler It's really very powerful.

DavideNardone commented 5 years ago

I've uploaded two big dataset at the following link. The information about these datasets are the summarized down here:

Name	# variables	# classes	Train set size	Test set size	Dataset size	Time series length
UCI_CHAR	9	6	7352	2947	10299	561
Motor imagery in ECoG recordings	64	2	278	100	378	3000

You can run the software on both CPU and GPU by using the following commands:

CPU: ./mdtwObj -t CLASSIFICATION -i CPU 3 0 -f <path_dataset>/DATA <path_dataset>/LABEL -k 10 0 -o 10299 561 -m 0 DTW -d 0

GPU : ./mdtwObj -t CLASSIFICATION -i GPU 3 <num_threads> 0 -f <path_dataset>/DATA <path_dataset>/LABEL -k 10 0 -o 10299 561 -m 0 DTW -d 0

I do know the NVIDIA Visual Profiler tool and I will try to use it to understand whether there's a bottleneck on my code. At first glance, i might say to you that a likely bottleneck might lie in transferring the data between the host and the device.

karlrupp commented 5 years ago

I encounter similar results (and thus conclusions) as @krischer . In my case, the example referenced in the first comment runs in 120 seconds on the CPU (Xeon E3-1240 v3) and 9 seconds on the GPU (GTX 760).

Keeping in mind that the CPU version

only uses a single core
is not as highly tuned as the GPU kernels (just looking at short_dtw_c() shows inefficient multi-dimensional arrays with mallocs and frees that could definitely be optimized further)
has no fundamental hardware characteristic that warrants a GPU speedup of more than a factor of ~10

the claims of three orders of magnitude performance gains in the paper cannot be supported.

Part of a review at: https://github.com/openjournals/joss-reviews/issues/1049

DavideNardone / MTSS-Multivariate-Time-Series-Software

Performance of GPU version #7