criteo / Spark-RSVD

Randomized SVD of large sparse matrices on Spark
Apache License 2.0
77 stars 22 forks source link

1M features x 60M rows? #3

Open Tagar opened 5 years ago

Tagar commented 5 years ago

Would this library scale to

https://stats.stackexchange.com/questions/355260/distributed-pca-or-an-equivalent

Thank you.

alois-bissuel commented 5 years ago

If I understand well, the sparsity of your data is around 10%. Our library is routinely used to decompose matrices of size 100M x 100M, but much more sparse. I do not see any reason why the lib should not work, though one would need to tweak the parameters a bit. Please adjust block size and number of blocks per partition so that each partition of the matrix and of the dense embeddings are less than 2Gb (for the definition of block and number of blocks per partition, see our article on medium), and start with a tiny embedding size (100 for instance).

Tagar commented 5 years ago

Thanks a lot @alois-bissuel

We will definitely give this distributed Spark-RSVD library a try! Those tuning recommendations will be very handy and helpful.