1M features x 60M rows?

Tagar commented 5 years ago

Would this library scale to

over 1M features (sparse, average population of features is around 12%);
over 60M rows.

https://stats.stackexchange.com/questions/355260/distributed-pca-or-an-equivalent

Thank you.

alois-bissuel commented 5 years ago

If I understand well, the sparsity of your data is around 10%. Our library is routinely used to decompose matrices of size 100M x 100M, but much more sparse. I do not see any reason why the lib should not work, though one would need to tweak the parameters a bit. Please adjust block size and number of blocks per partition so that each partition of the matrix and of the dense embeddings are less than 2Gb (for the definition of block and number of blocks per partition, see our article on medium), and start with a tiny embedding size (100 for instance).

Tagar commented 5 years ago

Thanks a lot @alois-bissuel

We will definitely give this distributed Spark-RSVD library a try! Those tuning recommendations will be very handy and helpful.

criteo / Spark-RSVD

1M features x 60M rows? #3