Open claczny opened 7 years ago
I explore this further and here is a minimal working example:
library(Rtsne.multicore) # Load package
library(digest)
iris_unique <- unique(iris) # Remove duplicates
mat <- as.matrix(iris_unique[,1:4])
set.seed(42) # Sets seed for reproducibility
tsne_out1 <- Rtsne.multicore(mat, num_threads = 1) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out1_2 <- Rtsne.multicore(mat, num_threads = 1) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out2 <- Rtsne.multicore(mat, num_threads = 2) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out2_2 <- Rtsne.multicore(mat, num_threads = 2) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out3 <- Rtsne.multicore(mat, num_threads = 3) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out3_2 <- Rtsne.multicore(mat, num_threads = 3) # Run TSNE
set.seed(42) # Sets seed for reproducibility
tsne_out4 <- Rtsne.multicore(mat, num_threads = 4) # Run TSNE
print(digest(tsne_out1))
print(digest(tsne_out1_2))
print(digest(tsne_out2))
print(digest(tsne_out2_2))
print(digest(tsne_out3))
print(digest(tsne_out3_2))
print(digest(tsne_out4))
and some demo output from Rstudio:
> source('~/.active-rstudio-document')
[1] "6adbcd6eb0106f49c7ac0a99eae369fc"
[1] "6adbcd6eb0106f49c7ac0a99eae369fc"
[1] "6269caaf71aca51ca57e2ead7425a14f"
[1] "6269caaf71aca51ca57e2ead7425a14f"
[1] "82974082989bc301349e03f3d9ee5c5b"
[1] "a8c779d9a4f54f2c14d84b624ffe9da9"
[1] "ccc0b4af068a4c2005504c0b1493e256"
> source('~/.active-rstudio-document')
[1] "6adbcd6eb0106f49c7ac0a99eae369fc"
[1] "6adbcd6eb0106f49c7ac0a99eae369fc"
[1] "6269caaf71aca51ca57e2ead7425a14f"
[1] "6269caaf71aca51ca57e2ead7425a14f"
[1] "b3479248cefc9b979521e13b25418223"
[1] "07dd9ce0d52e0cb0d1332f8d4849675c"
[1] "8b3a73318d64dd07f96ecdc2e06251d5"
As you can see, the results are consistent between different runs using the same number of threads (here for 1 or 2 threads) yet differ when using different numbers of threads. Moreover, I am confused as to why the results for 3 threads and 4 threads are different between two runs, i.e., behave differently than 1 or 2 threads.
This is quite puzzling to me.
Not sure as we didn't implement the multicore support, just wrapped the implementation. See https://github.com/jkrijthe/Rtsne/issues/16 , the Rtsne package has integrated the same multicore support. I'd suggest checking if that package produces the same issue, and discuss with the author.
Thanks for the answer.
Maybe you can tell me what I have missed there, but the Rtsne package seems not (yet) to contain parallelisation support. There seems to rather be some "derivative" of it (https://github.com/rappdw/tsne) which seems to be Python-based and currently without a wrapper for convenient use in R.
I observed that the results differ based on the number of threads specified.
In my application which used BH-SNE to create a 2D embedding followed by automated clustering using DBSCAN, I have replaced the single-threaded
Rtsne
call by a call to your multi-threadedRtsne.multicore
. This was nice&easy thanks to the similarity of both interfaces.However, when I run the application, the results differ ever so slightly, as indicated below (just the first couple of points each time): Using 1 thread
Using 2 threads
Using 3 threads
Using 4 threads
The results using the same number of threads seems to be consistent between different runs, though - which is good at least :)
Using 1 thread - a second run
And for all the points, computing the MD5SUM:
While the differences are hard to spot by eye - I mean in a 2D scatterplot -, the automatic clustering is affected by the differences.
Your input is greatly appreciated!
Best,
Cedric