I "futurized" this to run on on Python 3

pengowray commented 6 years ago

I've just used futurize and slightly modified the docs, so nothing too fancy. But it now runs on Python 3. Presumably still works in Python 2 too but not tested. Feel free to merge if you want or leave it forked.

Commands used:

pip install future
futurize --stage1 -w word2vec-api/*.py
futurize --stage2 -w word2vec-api/*.py

edit: and now a bunch of other fixes/changes to make it work, notably adding a --norm clobber setting detailed below.

pengowray commented 6 years ago

Memory leak?

I've additionally upgraded gensim and flask to the latest versions (787dae1). but since doing this it seems to sometimes use excessive RAM for /most_similar queries and I don't know why. There are reports of gensim using too much RAM when no "positive" argument is given, but I don't think that was the case here. Can anyone help? Previously the RAM usage was about the same as the size of the binary word2vec file, but now it uses more than double, reaching the limits of my test machine. Smaller word2vec files work fine.

pengowray commented 6 years ago

So it wasn't a memory leak, just a doubling of the memory required. I've added a "--norm clobber" option to mitigate the issue.

In the current version of gensim (3.6.0), the first call to model.most_similar() will generate unit-length normalized versions of every vector in the entire model, creating a lengthy delay the first time a /most_similar request is made and effectively doubling the memory required. I guess this didn't happen in the previous gensim used by this project (gensim 0.12.3).

I've added a --norm command line argument to let the user specify how to handle this issue:

--norm clobber Replace loaded vectors with normalized versions. Saves a lot of memory if the original vectors aren't needed.
--norm both (default) Preserve the original vectors (but double the memory requirement).
--norm already Treat model as already normalized.
--norm disable Disable 'most_similar' queries and do not normalize vectors.

I've made both the default as it gives the least surprising behavior, and keeps the API most consistent.

Here's a short explanation of gensim memory/load time issues, and a longer one, which I found useful, both by user gojomo.

I haven't added a way to specify a limit or mmap. I don't have a need for these and I've already added a good chunk of code to this project with this simple setting.

3Top / word2vec-api

I "futurized" this to run on on Python 3 #26