fh295 / SentenceRepresentation

124 stars 22 forks source link

the corpus unavaliable #3

Open redreamality opened 7 years ago

redreamality commented 7 years ago

http://www.cs.toronto.edu/~mbweb/ seems down at the moment, any mirrors?

agent-jay commented 7 years ago

I managed to get a hold of the dataset after mailing the authors of the paper, and I got two files- books_large_p1.txt and books_large_p2.txt. The code however refers to a books_large_70m.txt. Is that just the result of concatenating the two files? I'm trying to reproduce the results of the paper...

fh295 commented 7 years ago

Aha yes, glad that you managed to find the corpus.

To get the 70m file I concatenated them both and took the first 70m lines of that concatenated file. I was intending to save the rest for some new evaluations but in the end I never did it. Let me know if you need further clarification!

On 12 February 2017 at 16:22, agentJay notifications@github.com wrote:

I requested the dataset from the , and I get two files- books_large_p1.txt and books_large_p2.txt. The code however refers to a books_large_70m.txt. Is that just the result of concatenating the two files. I'm trying to reproduce the results of the paper...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fh295/SentenceRepresentation/issues/3#issuecomment-279229276, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6L9jhQLygBSMFe9Rk1YksBx2TKiIK5ks5rbzGtgaJpZM4Ka_CF .

--

Felix Hill University of Cambridge fh295@cam.ac.uk

http://www.cl.cam.ac.uk/~fh295/

agent-jay commented 7 years ago

Gotcha. Thanks!

1024er commented 5 years ago

Will you please share me the dataset please? thank you. wuxing@iie.ac.cn