How to get the /dasem_data/gutenberg/word2vec.pkl.gz

fnielsen / dasem

Danish Semantic analysis

Apache License 2.0

18 stars 3 forks source link

How to get the /dasem_data/gutenberg/word2vec.pkl.gz #7

Closed bipsen closed 5 years ago

bipsen commented 5 years ago

I have downloaded the gutenberg data as described in dasem/gutenberg.py. When i run the following, however, I get an error.

from dasem.gutenberg import Word2Vec
Word2Vec()

gives me: FileNotFoundError: [Errno 2] No such file or directory: '/home/USER/dasem_data/gutenberg/word2vec.pkl.gz' I am unsure of how to get the word2vec.pkl.gz file. Can I download the model from somewhere, or is the idea that I train the model myself using gensim? Sorry for my inexperience with dasem, gensim, and word2vec in general. Thank you for your great work.

fnielsen commented 5 years ago

I am sorry. The error message certainly needs to be improved.

It is necessary to download the Gutenberg corpus yourself, and there are some retrictions on this: https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages

You can see a bit of explaination with

python -m dasem.gutenberg --help

You can download the corpus with:

 python -m dasem.gutenberg download

Here the wget program is required.

bipsen commented 5 years ago

Thank you for your quick reply! I'm sorry if I wasn't clear. I've already run those commands, and I have two folders in my dasem_data/guteberg directory called aleph.gutenberg.org andwww.gutenberg.org containing the ebooks. My problem is that I don't have the word2vec.pkl.gz file - how do I get that?

fnielsen commented 5 years ago

I see your problem. I have attempted to fix the problem with a new version of dasem. If you clone and download a new version you might be able to get it to work.

python -m dasem.gutenberg train-and-save-word2vec

and then:

python -m dasem.gutenberg most-similar kvinde

bipsen commented 5 years ago

Thank you, it all works now! Looking forward to using dasem.

fnielsen commented 5 years ago

The Gutenberg corpus is not that big, so not that good on its own. It may perform poorly on semantic tasks, see our paper "Open semantic analysis: The case of word level semantics in Danish" http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/7029/pdf/imm7029.pdf

bipsen commented 5 years ago

Right, I realize that. I have read your paper. Is there a way to get the aggregate model trained all five corpora? Or is there something else you would recommend? Once again, thank you so much for you help, very appreciated!

Edit: I have opened another issue to deal with the LCC/Wikipedia stuff, in case it might be helpful for other people in the future. See https://github.com/fnielsen/dasem/issues/8. Hope that's ok. Edit2: I am still interested in hearing about a possible aggregate model, though!

fnielsen commented 5 years ago

"Is there a way to get the aggregate model trained all five corpora?"

The "Fullmonty" model is collecting the corpora.