PrincetonML / SIF

sentence embedding by Smooth Inverse Frequency weighting scheme
MIT License
1.08k stars 306 forks source link

number of sentences #4

Open clockwiser opened 7 years ago

clockwiser commented 7 years ago

In the example, MSRpar2012 has only 750 lines of sentences. The theory works fine with small volume of data. But for big data, for example, the number of sentences is near 400,000.Then the calculation of pca could be a big problem. Can SIF handle big data?

YingyuLiang commented 7 years ago

The algorithm itself will not be very sensitive to the scale of the data, but the current implementation just uses the PCA function in sklearn. One can better algorithms to compute the PCA (e.g., randomized algorithm for PCA in sklearn).

clockwiser commented 7 years ago

Thank you. I tried on big data and it seems working.

yg37 commented 7 years ago

A follow-up question: Have you done any experiments to see how sensitive the results in the paper are to the number of sentences? Is there a rule of thumb as to how many sentences are needed to reveal c0 or is it really task/data-dependent? @YingyuLiang

qingyuanxingsi commented 6 years ago

@YingyuLiang The computation complexity for sentence embedding is ok, but how can I compute the first singular vector for millions of sentences, can I sample some sentences and compute u??Or is there any other methods?