Inefficient recalculation of word embeddings format?

ArnaudFerre / CONTES

CONcept-TErm System

Apache License 2.0

9 stars 2 forks source link

Inefficient recalculation of word embeddings format? #7

Open ArnaudFerre opened 5 years ago

ArnaudFerre commented 5 years ago

This line structures a binary model (from Gensim) to a text file. I think that we need to adapt all the other functions (main_train.train(), word2term.wordVST2TermVST() and so all word2term.py functions) to use Gensim format.

https://github.com/ArnaudFerre/CONTES/blob/3eae963bbe3b1cf8c20b685d7beed115feb72be2/module_train/main_train.py#L172

rbossy commented 5 years ago

I suggest to replace the JSON format with the binary format in the experiments workflow first without touching at the code yet.

Then, if and only if there is no significant performance improvement, fix this issue. I suspect that it won't be necessary because the bottleneck is actually reading a big JSON file.

@ArnaudFerre

ArnaudFerre commented 4 years ago

Indeed, the operation seems relatively fast, but takes an important amount of RAM. Perhaps, it is not a priority, but some complementary solutions:

del the bigger variable when they are no more used (or check that we rewrite these file in the same file)
Use tolist, which could be faster (https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tolist.html). Assigned to @ArnaudFerre