Feature/w2v fix - Githubissues

sadedegel/bblock/cli/__main__.py

sadedegel/bblock/utils.py

tr_lower receives Token objects while trying to form the vocabulary. Handle Token receiving case for tr_lower and tr_upper methods.

sadedegel/bblock/vocabulary.py

Fix gensim related argument and instance method names.
number of indices of tokens that has vectors are less than all number of unique tokens. They are stored in two separate groups in h5py vocabulary dumps. I noticed when I called has_vector attribute of a Token instance, it queries the has_vector group with an index taken from word group. As a result wrong boolean value is returned to the user.
I fixed this by separating id2feat and vice versa for all tokens and tokens that have vector.

tests/token/test_vectors.py

README.md

GlobalMaksimum / sadedegel