criteo / Spark-RSVD

Randomized SVD of large sparse matrices on Spark
Apache License 2.0
77 stars 22 forks source link

norm of the discrepancy between the computed approximation UV ∗ and the matrix A being decomposed #4

Open Tagar opened 5 years ago

Tagar commented 5 years ago

Facebook did a great job on comparing their own distributed SVD implementation to stock Spark ML one.

http://tygert.com/spark.pdf

See for example this table -

image

Do you have an estimate for norm of the discrepancy between the computed approximation UV ∗ and the matrix A being decomposed?

Thank you.

Ivanopolo commented 5 years ago

Hi @Tagar! It is a very good question. We have unit-tests that validate the quality of reconstruction on toy matrices, but we do not do that for the full-scale matrice since it is not trivial (to compute it to the full matrix, you need O(m x n) operations which is usually not feasible unless you use some sort of subsampling). Internally, we use other metrics to validate the quality of extracted eigenvectors (basically, we try them for our task related to recommendation) and it satisfies us. But we would love to see some contributions related to this! Any external evaluation or a method to efficiently compute these metrics would be very much welcome. If you need any help/advice on how to better do that, just let me know.

tygert commented 5 years ago

@Ivanopolo, @Tagar is right that the accuracy of the randomized methods can be extremely finicky for large-scale matrices. http://tygert.com/spark.pdf uses the power method to ascertain the accuracy. Attaining good accuracy for nearly rank-deficient matrices (which are the rule rather than the exception in our experience at Facebook) requires care with the distributed QR. Algorithms which produce (in fact, guarantee) good accuracy are available at http://tygert.com/spark.pdf with implementations at http://github.com/hl475/svd and expository prototypes in Python at http://tygert.com/valid.tar.gz ... Thanks for the undeserved compliment, @Tagar, and for looking out for all of us in the open-source community! Facebook hopes these methods and the software (written at Yale by @hl475 before he joined Facebook) can be valuable to you. P.S. Some systems at Facebook retrofit existing codes for alternating least squares via the trick elaborated at http://tygert.com/als.pdf (admittedly inelegant, but accurate and convenient if the approach at http://tygert.com/spark.pdf would require too much work to implement).