Closed bipsen closed 5 years ago
I am sorry. The error message certainly needs to be improved.
It is necessary to download the Gutenberg corpus yourself, and there are some retrictions on this: https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages
You can see a bit of explaination with
python -m dasem.gutenberg --help
You can download the corpus with:
python -m dasem.gutenberg download
Here the wget
program is required.
Thank you for your quick reply! I'm sorry if I wasn't clear. I've already run those commands, and I have two folders in my dasem_data/guteberg
directory called aleph.gutenberg.org
andwww.gutenberg.org
containing the ebooks. My problem is that I don't have the word2vec.pkl.gz
file - how do I get that?
I see your problem. I have attempted to fix the problem with a new version of dasem
. If you clone and download a new version you might be able to get it to work.
python -m dasem.gutenberg train-and-save-word2vec
and then:
python -m dasem.gutenberg most-similar kvinde
Thank you, it all works now! Looking forward to using dasem.
The Gutenberg corpus is not that big, so not that good on its own. It may perform poorly on semantic tasks, see our paper "Open semantic analysis: The case of word level semantics in Danish" http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/7029/pdf/imm7029.pdf
Right, I realize that. I have read your paper. Is there a way to get the aggregate model trained all five corpora? Or is there something else you would recommend? Once again, thank you so much for you help, very appreciated!
Edit: I have opened another issue to deal with the LCC/Wikipedia stuff, in case it might be helpful for other people in the future. See https://github.com/fnielsen/dasem/issues/8. Hope that's ok. Edit2: I am still interested in hearing about a possible aggregate model, though!
"Is there a way to get the aggregate model trained all five corpora?"
The "Fullmonty" model is collecting the corpora.
I have downloaded the gutenberg data as described in dasem/gutenberg.py. When i run the following, however, I get an error.
gives me:
FileNotFoundError: [Errno 2] No such file or directory: '/home/USER/dasem_data/gutenberg/word2vec.pkl.gz'
I am unsure of how to get the word2vec.pkl.gz file. Can I download the model from somewhere, or is the idea that I train the model myself using gensim? Sorry for my inexperience with dasem, gensim, and word2vec in general. Thank you for your great work.