gensim interface for tranining needed updates on certain argument names.
sadedegel/bblock/utils.py
tr_lower receives Token objects while trying to form the vocabulary. Handle Token receiving case for tr_lower and tr_upper methods.
sadedegel/bblock/vocabulary.py
Fix gensim related argument and instance method names.
number of indices of tokens that has vectors are less than all number of unique tokens. They are stored in two separate groups in h5py vocabulary dumps. I noticed when I called has_vector attribute of a Token instance, it queries the has_vector group with an index taken from word group. As a result wrong boolean value is returned to the user.
I fixed this by separating id2feat and vice versa for all tokens and tokens that have vector.
tests/token/test_vectors.py
Implement test indices, word-id match for both mappers.
Implement test for vector access.
README.md
Update with the description of vocabulary dumps and w2v training.
sadedegel/bblock/cli/__main__.py
gensim
interface for tranining needed updates on certain argument names.sadedegel/bblock/utils.py
tr_lower
receivesToken
objects while trying to form the vocabulary. HandleToken
receiving case fortr_lower
andtr_upper
methods.sadedegel/bblock/vocabulary.py
gensim
related argument and instance method names.h5py
vocabulary dumps. I noticed when I calledhas_vector
attribute of aToken
instance, it queries thehas_vector
group with an index taken fromword
group. As a result wrong boolean value is returned to the user.id2feat
and vice versa for all tokens and tokens that have vector.tests/token/test_vectors.py
README.md
w2v
training.