Open ArnaudFerre opened 5 years ago
I suggest to replace the JSON format with the binary format in the experiments workflow first without touching at the code yet.
Then, if and only if there is no significant performance improvement, fix this issue. I suspect that it won't be necessary because the bottleneck is actually reading a big JSON file.
@ArnaudFerre
Indeed, the operation seems relatively fast, but takes an important amount of RAM. Perhaps, it is not a priority, but some complementary solutions:
del
the bigger variable when they are no more used (or check that we rewrite these file in the same file)tolist
, which could be faster (https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tolist.html).
Assigned to @ArnaudFerre
This line structures a binary model (from Gensim) to a text file. I think that we need to adapt all the other functions (main_train.train(), word2term.wordVST2TermVST() and so all word2term.py functions) to use Gensim format.
https://github.com/ArnaudFerre/CONTES/blob/3eae963bbe3b1cf8c20b685d7beed115feb72be2/module_train/main_train.py#L172