DmitryUlyanov / Multicore-TSNE

Parallel t-SNE implementation with Python and Torch wrappers.
Other
1.89k stars 228 forks source link

Results differ based on number of threads #22

Open claczny opened 7 years ago

claczny commented 7 years ago

This is a "foward" from https://github.com/RGLab/Rtsne.multicore/issues/7

Part 1

I observed that the results differ based on the number of threads specified.

In my application which used BH-SNE to create a 2D embedding followed by automated clustering using DBSCAN, I have replaced the single-threaded Rtsne call by a call to your multi-threaded Rtsne.multicore. This was nice&easy thanks to the similarity of both interfaces.

However, when I run the application, the results differ ever so slightly, as indicated below (just the first couple of points each time): Using 1 thread

-4.3473001944841 -9.88816236259427
-0.264536173449281 2.26121958696939
-11.8037471711157 -1.23420653192463
18.5043209507443 -13.4638139443446
1.51823629529208 -27.2209786228982
8.44296382274354 11.5004388863181
17.0385503073606 -19.5842234534257
-1.80122124653633 -35.1542911986375
-14.9339466535662 11.4724805072396
-16.7179891732902 10.300907221322

Using 2 threads

-4.33102494052646 -9.94346771160292
-0.300330796745644 2.47627128482164
-14.4865548712467 3.83169546954971
18.0266761572745 -13.3481838170748
1.55009711170931 -27.3536683521347
8.57133969496983 11.704078885386
16.8146752705904 -19.4804761345993
-1.67702875389705 -35.6116919363096
-16.328562693303 10.9834569354747
-17.9212513482976 10.1738069116024

Using 3 threads

-4.15202535615338 -9.91628914440292
-0.266922842312901 2.30165398545058
-12.0458514750223 -1.26327092092668
18.3116039523395 -13.4472311793933
1.8728867702686 -27.0478452540983
8.21259960134093 11.338018514761
16.938103908809 -19.4664656504238
-1.51129210868152 -35.5926372619633
-15.7107052664802 10.622091607029
-16.9275577907434 10.5760540704756

Using 4 threads

-4.40493207317474 -10.2542865145978
-0.240311071414228 2.34386945654285
-11.613066543124 -1.22167721092907
17.978213066292 -13.6367838896947
1.68103298346623 -27.3950001130062
8.48320430773571 11.5841961868582
16.5975194709815 -19.6467988772466
-1.21063128661383 -35.6738754692542
-16.2962040171112 11.6000609166704
-16.4988660902924 10.7927849813962

The results using the same number of threads seems to be consistent between different runs, though - which is good at least :)

Using 1 thread - a second run

-4.3473001944841 -9.88816236259427
-0.264536173449281 2.26121958696939
-11.8037471711157 -1.23420653192463
18.5043209507443 -13.4638139443446
1.51823629529208 -27.2209786228982
8.44296382274354 11.5004388863181
17.0385503073606 -19.5842234534257
-1.80122124653633 -35.1542911986375
-14.9339466535662 11.4724805072396
-16.7179891732902 10.300907221322

And for all the points, computing the MD5SUM:

cat ./one_threads/one.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
2410c2539be68ffe1f52d1be0f04bfac  -
cat ./one_threads_old/one.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
2410c2539be68ffe1f52d1be0f04bfac  -
cat ./two_threads/two.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
1f7dd4212d74b162420c79e619b3b91b  -
 cat ./three_threads/three.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
f659b3527318c9545766fed14fc72daa  -
./four_threads/four.bin.embedding.tsv | awk '{print $1,$2}' | gmd5sum
0e7425b7acf3438d047fb1550bbd069f  -

While the differences are hard to spot by eye - I mean in a 2D scatterplot -, the automatic clustering is affected by the differences.

Your input is greatly appreciated!

Part 2

I explore this further and here is a minimal working example:

library(Rtsne.multicore) # Load package
library(digest)
iris_unique <- unique(iris) # Remove duplicates
mat <- as.matrix(iris_unique[,1:4])
set.seed(42) # Sets seed for reproducibility
tsne_out1 <- Rtsne.multicore(mat, num_threads = 1) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out1_2 <- Rtsne.multicore(mat, num_threads = 1) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out2 <- Rtsne.multicore(mat, num_threads = 2) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out2_2 <- Rtsne.multicore(mat, num_threads = 2) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out3 <- Rtsne.multicore(mat, num_threads = 3) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out3_2 <- Rtsne.multicore(mat, num_threads = 3) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out4 <- Rtsne.multicore(mat, num_threads = 4) # Run TSNE
print(digest(tsne_out1))
print(digest(tsne_out1_2))
print(digest(tsne_out2))
print(digest(tsne_out2_2))
print(digest(tsne_out3))
print(digest(tsne_out3_2))
print(digest(tsne_out4))

and some demo output from Rstudio:

> source('~/.active-rstudio-document')
[1] "6adbcd6eb0106f49c7ac0a99eae369fc"
[1] "6adbcd6eb0106f49c7ac0a99eae369fc"
[1] "6269caaf71aca51ca57e2ead7425a14f"
[1] "6269caaf71aca51ca57e2ead7425a14f"
[1] "82974082989bc301349e03f3d9ee5c5b"
[1] "a8c779d9a4f54f2c14d84b624ffe9da9"
[1] "ccc0b4af068a4c2005504c0b1493e256"
> source('~/.active-rstudio-document')
[1] "6adbcd6eb0106f49c7ac0a99eae369fc"
[1] "6adbcd6eb0106f49c7ac0a99eae369fc"
[1] "6269caaf71aca51ca57e2ead7425a14f"
[1] "6269caaf71aca51ca57e2ead7425a14f"
[1] "b3479248cefc9b979521e13b25418223"
[1] "07dd9ce0d52e0cb0d1332f8d4849675c"
[1] "8b3a73318d64dd07f96ecdc2e06251d5"

As you can see, the results are consistent between different runs using the same number of threads (here for 1 or 2 threads) yet differ when using different numbers of threads. Moreover, I am confused as to why the results for 3 threads and 4 threads are different between two runs, i.e., behave differently than 1 or 2 threads.

This is quite puzzling to me and your input is highly appreciated!

Best,

Cedric