Open clockwiser opened 7 years ago
The algorithm itself will not be very sensitive to the scale of the data, but the current implementation just uses the PCA function in sklearn. One can better algorithms to compute the PCA (e.g., randomized algorithm for PCA in sklearn).
Thank you. I tried on big data and it seems working.
A follow-up question:
Have you done any experiments to see how sensitive the results in the paper are to the number of sentences? Is there a rule of thumb as to how many sentences are needed to reveal c0
or is it really task/data-dependent? @YingyuLiang
@YingyuLiang The computation complexity for sentence embedding is ok, but how can I compute the first singular vector for millions of sentences, can I sample some sentences and compute u??Or is there any other methods?
In the example, MSRpar2012 has only 750 lines of sentences. The theory works fine with small volume of data. But for big data, for example, the number of sentences is near 400,000.Then the calculation of pca could be a big problem. Can SIF handle big data?