Open q0o0p opened 4 years ago
Hi q0o0p,
Thank you for looking closely at the code. I must confess I wrote it quite some time ago, and it was not intended to be shared at first, I just realized afterwards that the project could be useful to some people. That's why its quite a mess, and it has some inconsistencies between pypi and github. So congrats for inspecting it anyway!
I read your comment and took a look: I agree with everything you said, thank you for pointing that out!
This might be a correction:
from marisa_trie import Trie
from sklearn.feature_extraction.text import TfidfVectorizer
class MarisaTfidfVectorizer(TfidfVectorizer):
def __init__(self, **kwargs):
super().__init__(**kwargs)
def _freeze_vocabulary(self):
if not self.fixed_vocabulary_:
self.vocabulary_ = Trie(self.vocabulary_.keys())
self.fixed_vocabulary_ = True
del self.stop_words_
def fit(self, raw_documents, y=None):
super().fit(raw_documents, y)
self._freeze_vocabulary()
return self
def fit_transform(self, raw_documents, y=None):
self.fit(raw_documents, y)
return self.transform(raw_documents, copy=True)
def get_params(self, deep=True):
""" Overrides get_params to get a correct __repr___ """
params_ = { param: value for param, value in self.__dict__.items()
if param.strip('_') == param }
return {
**super().__dict__['_tfidf'].get_params(),
**params_
}
if __name__ == '__main__':
import joblib
from string import ascii_lowercase
from numpy.random import RandomState
_NDOCS = 5_000
_LETTERS = [*ascii_lowercase]
prng = RandomState(seed=2020)
def make_random_sentence() -> str:
return ' '.join(''.join(prng.choice(_LETTERS, prng.randint(1, 10)))
for _ in range(prng.randint(1, 30)))
docs = [make_random_sentence() for _ in range(_NDOCS)]
tfidf = TfidfVectorizer(ngram_range=(1, 3))
print(repr(tfidf.fit_transform(docs)))
# >> <5000x178662 sparse matrix of type '<class 'numpy.float64'>'
# >> with 188411 stored elements in Compressed Sparse Row format>
joblib.dump(tfidf, 'base.joblib')
tfidf = MarisaTfidfVectorizer(ngram_range=(1, 3))
print(repr(tfidf.fit_transform(docs)))
# >> <5000x178662 sparse matrix of type '<class 'numpy.float64'>'
# >> with 188411 stored elements in Compressed Sparse Row format>
joblib.dump(tfidf, 'marisa.joblib')
If you run this code above, you will see that this time both vectorizers have ngram_range set to (1, 3) and outputs a sparse matrix that have the same shape and number of non-zero elements.
When they are serialized, base.joblib is 12.4Mo and marisa.joblib is 3.6Mo, which represent a ~3.4 compression (but compression power depends on the size of the vocabulary, so this is not a fixed compression boost).
I also gave it a try with real-world dataset, and I took the IMDB sentiment analysis dataset. I trained a basic pipeline with TfidfVectorizer(ngram_range=(1, 3))
and MarisaTfidfVectorizer(ngram_range=(1, 3))
, followed by MultinomialNB()
, which gave ~87% and ~85% accuracy (that is, a 2% decrease on perfs)
Thank you for the response!
Now initialization looks good.
Though, if I understand correctly, accuracy shouldn't change because we don't change logic of vectorizer, only format of its stored inner details.
If you fit it with small example (3 letters, 5 docs, 5 words in doc) and print self.vocabulary_.items()
in _freeze_vocabulary
before and after self.vocabulary_ = Trie(self.vocabulary_.keys())
instruction, you can see something like this:
vocab before: [('ab', 0), ('abbbcbc', 1), ('babbbbb', 2), ('bac', 3), ('bccccb', 4), ('cabb', 5), ('cbabaaacb', 6), ('cbacbcba', 7), ('ccaca', 8), ('ccacbccaa', 9), ('cccac', 10), ('cccbaaaa', 11)]
vocab after: [('ab', 0), ('abbbcbc', 3), ('babbbbb', 6), ('bac', 7), ('bccccb', 2), ('cabb', 1), ('cbabaaacb', 4), ('cbacbcba', 5), ('ccaca', 8), ('ccacbccaa', 9), ('cccac', 10), ('cccbaaaa', 11)]
I.e. some key-value pairs are changed.
As a result, in transform
some words will be used with wrong tfidf coefficients.
In your original version of MarisaTfidfVectorizer
(that is current version in the repo) in method fit
you run super(MarisaTfidfVectorizer, self).fit(raw_documents, y)
again after Trie-transformation of vocabulary, that is correct I think.
I tried what you said, and here are my observations and thoughts.
If you call super(MarisaTfidfVectorizer, self).fit(raw_documents, y)
again, the Trie version of the vocabulary stored in self.vocabulary_
will be ignored, because even we set theself.fixed_vocabulary_
attribute to True, sklearn TfidfVectorizer constructor expect a vocabulary
parameter if the user want to specify a precomputed vocabulary. If there is not such parameter, by default the TfidfVectorizer will compute vocabulary from corpus again, and store it as a regular Python dict.
So if you do this, you indeed get the exact same representation of TFIDF weights in transform
, but thats because you actually get the same TfidfVectorizer object. As a consequence, marisa part is ignored/useless, and you get a model with the exact same size as if you had just fitted a normal TfidfVectorizer (-> no compression at all)
You would have to do something like
super().fit(raw_documents, y)
super().__init__(**{**self.get_params(),
'vocabulary': Trie(self.vocabulary_.keys())})
# or self.vocabulary = Trie(self.vocabulary_.keys())
super().fit(raw_documents, y)
That is, fit a first time to get the vocabulary, use this vocabulary to get a Trie representation and set Trie representation as a precomputed vocabulary parameter to second call to fit, but in that case it is equivalent to the code above in discussion. And you end up with 2 calls to fit (of course the second one doesn't cost much, because vocab is already precomputed), and also with the vocab stored both in vocabulary_
and vocabulary
attributes.
Using the code above, i get a 2% decrease in accuracy on IMDB dataset, but vectorizer size once pickled goes from 373.6 Mo to 99.6 Mo, which represent almost a x4 compression! This if for a dataset of 25k reviews of train, 25k reviews of test. The 25k train reviews on which the vectorizer is fitted leads to a 5 165 791 tokens vocabulary (for ngram_range=(1, 3) and other parameters set to sklearn default values). Ive made no finetuning neither feature selection method (as it is not our concern), but maybe that could also smooth the 2% performance decrease.
I don't know much about Trie structures, but the idea is that vocabulary is stored as a marisa-trie rather than a python dict, which change not only the mapping word<->index itself but also the underlying C++ implementations to store the data and access elements. For that reason, I am not suprised if there are some minor differences in the tfidf vectors (maybe some hashing conflicts ?)
Regarding word <-> index mapping change:
I don't know much about Trie structures, but the idea is that vocabulary is stored as a marisa-trie rather than a python dict, which change not only the mapping word<->index itself but also the underlying C++ implementations to store the data and access elements. For that reason, I am not suprised if there are some minor differences in the tfidf vectors (maybe some hashing conflicts ?)
When we initialize trie as follows:
self.vocabulary_ = Trie(self.vocabulary_.keys())
we pass only keys of vocabulary, i.e. Trie
initializer knows nothing about values. Of course key-value pairs will be different, there is no guarantee that they will be at least a little bit similar to key-value pairs created before by sklearn vectorizer.
To preserve mapping, we need to pass info about keys and values to Trie
initializer, not only keys. In general, trie structure changes inner implementation of storage, but it shouldn't change key-value pairs.
Marisa trie documentation looks scarce, but there is something about initialization with key-value pairs:
https://github.com/pytries/marisa-trie/blob/master/docs/tutorial.rst#marisa_trierecordtrie
There is no info about how to initialize marisa_trie.Trie
with key-value pairs. But there is another structure, marisa_trie.RecordTrie
, looks like it will be suitable (because it supports required initialization). Though, we will need a wrapper because this structure works with lists of values, but we need only single values.
Anyway, only 2% decrease in accuracy on IMDB dataset is great.
Hi Elliot,
First of all, thank you for interesting project!
However, looks like there is a bug here.
It's said in Readme:
BTW, "Ko" looks like a typo here, because size of pkl file that is downloaded via pip is 29.5MB.
I have inspected inner structure of loaded pkl file. In this model in vectorizer
vocabulary_
is just a python dict, not a trie.This contradicts to what I see in Readme, so I investigated this situation, and in process of investigation I have found another bug.
Trained model uses
MarisaTfidfVectorizer
for vectorization. It's derived fromsklearn.TfidfVectorizer
.There is no training code in module downloaded from pip, but I found something here in repo.
There is a following code in the latest version of
sentimentClassifier.py
in repo:If we look closely we can see a bug:
The positional argument passed to
MarisaTfidfVectorizer
initializer is an instance ofsklearn TfidfVectorizer
. But first parameter ofMarisaTfidfVectorizer
isinput
(as it is derived fromTfidfVectorizer
). It's not a C++-like copy-constructor. According to sklearn docs, this parameter is an enumeration, it must be'filename'
,'file'
, or'content'
. Here you pass unexpected value asinput
. Object of classTfidfVectorizer
is passed. There is no type checking in this part of sklearn code, soinput
is only used in_VectorizerMixin
where it's compared with'filename'
and'file'
. As it's not equal, it is assumed to be'content'
(that is default BTW). Thus, passing object of classTfidfVectorizer
asinput
doesn't have any effect comparing to not passinginput
at all.But why you observe memory usage improvements? Probably, the reason is following:
You initialize
TfidfVectorizer
object (that is passed asinput
) with non-default value ofngram_range
: (1, 3). And in the code downloaded from pip there is a following variable (not used for prediction, looks like it is provided only for info):But, as
TfidfVectorizer
object is not used (except storing it in 'input' field) because of the initialization error described above,ngram_range
of resulting vectorizer is actually (1, 1) that can be checked by inspecting fields ofMarisaTfidfVectorizer
object in trained model file downloaded from pip. Of course, it decreases memory usage, comparing to ngrams (1, 3) version of original (non-Marisa)TfidfVectorizer
.