Bug in MarisaTfidfVectorizer initialization

Hi Elliot,

First of all, thank you for interesting project!

However, looks like there is a bug here.

It's said in Readme:

Marisa-trie is used to make the final trained model.pkl memory-efficient ( from 150Ko to 28Ko!).

BTW, "Ko" looks like a typo here, because size of pkl file that is downloaded via pip is 29.5MB.

I have inspected inner structure of loaded pkl file. In this model in vectorizer vocabulary_ is just a python dict, not a trie.

This contradicts to what I see in Readme, so I investigated this situation, and in process of investigation I have found another bug.

Trained model uses MarisaTfidfVectorizer for vectorization. It's derived from sklearn.TfidfVectorizer.

There is no training code in module downloaded from pip, but I found something here in repo.

There is a following code in the latest version of sentimentClassifier.py in repo:

self.sentiment_pipeline = Pipeline((
'v', MarisaTfidfVectorizer.MarisaTfidfVectorizer(TfidfVectorizer(ngram_range=ngram_range, max_features=max_features)),
'f', SelectKBest(chi2, k=k),
'c', MultinomialNB(alpha=alpha)))

If we look closely we can see a bug:

The positional argument passed to MarisaTfidfVectorizer initializer is an instance of sklearn TfidfVectorizer. But first parameter of MarisaTfidfVectorizer is input (as it is derived from TfidfVectorizer). It's not a C++-like copy-constructor. According to sklearn docs, this parameter is an enumeration, it must be 'filename', 'file', or 'content'. Here you pass unexpected value as input. Object of class TfidfVectorizer is passed. There is no type checking in this part of sklearn code, so input is only used in _VectorizerMixin where it's compared with 'filename' and 'file'. As it's not equal, it is assumed to be 'content' (that is default BTW). Thus, passing object of class TfidfVectorizer as input doesn't have any effect comparing to not passing input at all.

But why you observe memory usage improvements? Probably, the reason is following:

You initialize TfidfVectorizer object (that is passed as input) with non-default value of ngram_range: (1, 3). And in the code downloaded from pip there is a following variable (not used for prediction, looks like it is provided only for info):

## Constants Classifier
best_params = {'c__alpha': 0.03,
               'f__k': 200000,
               'v__max_features': 300000,
               'v__ngram_range': (1, 3)}

But, as TfidfVectorizer object is not used (except storing it in 'input' field) because of the initialization error described above, ngram_range of resulting vectorizer is actually (1, 1) that can be checked by inspecting fields of MarisaTfidfVectorizer object in trained model file downloaded from pip. Of course, it decreases memory usage, comparing to ngrams (1, 3) version of original (non-Marisa) TfidfVectorizer.

Hi q0o0p,

Thank you for looking closely at the code. I must confess I wrote it quite some time ago, and it was not intended to be shared at first, I just realized afterwards that the project could be useful to some people. That's why its quite a mess, and it has some inconsistencies between pypi and github. So congrats for inspecting it anyway!

I read your comment and took a look: I agree with everything you said, thank you for pointing that out!

This might be a correction:

from marisa_trie import Trie

from sklearn.feature_extraction.text import TfidfVectorizer

class MarisaTfidfVectorizer(TfidfVectorizer):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def _freeze_vocabulary(self):
        if not self.fixed_vocabulary_:
            self.vocabulary_ = Trie(self.vocabulary_.keys())
            self.fixed_vocabulary_ = True
            del self.stop_words_

    def fit(self, raw_documents, y=None):
        super().fit(raw_documents, y)
        self._freeze_vocabulary()
        return self

    def fit_transform(self, raw_documents, y=None):
        self.fit(raw_documents, y)
        return self.transform(raw_documents, copy=True)

    def get_params(self, deep=True):
        """ Overrides get_params to get a correct __repr___ """
        params_ = { param: value for param, value in self.__dict__.items()
                    if param.strip('_') == param }

        return {
            **super().__dict__['_tfidf'].get_params(),
            **params_
        }

if __name__ == '__main__':
    import joblib
    from string import ascii_lowercase
    from numpy.random import RandomState

    _NDOCS = 5_000
    _LETTERS = [*ascii_lowercase]

    prng = RandomState(seed=2020)

    def make_random_sentence() -> str:
        return ' '.join(''.join(prng.choice(_LETTERS, prng.randint(1, 10)))
                        for _ in range(prng.randint(1, 30)))

    docs = [make_random_sentence() for _ in range(_NDOCS)]

    tfidf = TfidfVectorizer(ngram_range=(1, 3))
    print(repr(tfidf.fit_transform(docs)))
    # >> <5000x178662 sparse matrix of type '<class 'numpy.float64'>'
    # >> with 188411 stored elements in Compressed Sparse Row format>
    joblib.dump(tfidf, 'base.joblib')

    tfidf = MarisaTfidfVectorizer(ngram_range=(1, 3))
    print(repr(tfidf.fit_transform(docs)))
    # >> <5000x178662 sparse matrix of type '<class 'numpy.float64'>'
    # >> with 188411 stored elements in Compressed Sparse Row format>
    joblib.dump(tfidf, 'marisa.joblib')

If you run this code above, you will see that this time both vectorizers have ngram_range set to (1, 3) and outputs a sparse matrix that have the same shape and number of non-zero elements.

When they are serialized, base.joblib is 12.4Mo and marisa.joblib is 3.6Mo, which represent a ~3.4 compression (but compression power depends on the size of the vocabulary, so this is not a fixed compression boost).

I also gave it a try with real-world dataset, and I took the IMDB sentiment analysis dataset. I trained a basic pipeline with TfidfVectorizer(ngram_range=(1, 3)) and MarisaTfidfVectorizer(ngram_range=(1, 3)), followed by MultinomialNB(), which gave ~87% and ~85% accuracy (that is, a 2% decrease on perfs)

Thank you for the response!

Now initialization looks good.

Though, if I understand correctly, accuracy shouldn't change because we don't change logic of vectorizer, only format of its stored inner details.

If you fit it with small example (3 letters, 5 docs, 5 words in doc) and print self.vocabulary_.items() in _freeze_vocabulary before and after self.vocabulary_ = Trie(self.vocabulary_.keys()) instruction, you can see something like this:

vocab before:  [('ab', 0), ('abbbcbc', 1), ('babbbbb', 2), ('bac', 3), ('bccccb', 4), ('cabb', 5), ('cbabaaacb', 6), ('cbacbcba', 7), ('ccaca', 8), ('ccacbccaa', 9), ('cccac', 10), ('cccbaaaa', 11)] 

vocab after:  [('ab', 0), ('abbbcbc', 3), ('babbbbb', 6), ('bac', 7), ('bccccb', 2), ('cabb', 1), ('cbabaaacb', 4), ('cbacbcba', 5), ('ccaca', 8), ('ccacbccaa', 9), ('cccac', 10), ('cccbaaaa', 11)]

I.e. some key-value pairs are changed.

As a result, in transform some words will be used with wrong tfidf coefficients.

In your original version of MarisaTfidfVectorizer (that is current version in the repo) in method fit you run super(MarisaTfidfVectorizer, self).fit(raw_documents, y) again after Trie-transformation of vocabulary, that is correct I think.

I tried what you said, and here are my observations and thoughts.

If you call super(MarisaTfidfVectorizer, self).fit(raw_documents, y) again, the Trie version of the vocabulary stored in self.vocabulary_ will be ignored, because even we set theself.fixed_vocabulary_ attribute to True, sklearn TfidfVectorizer constructor expect a vocabulary parameter if the user want to specify a precomputed vocabulary. If there is not such parameter, by default the TfidfVectorizer will compute vocabulary from corpus again, and store it as a regular Python dict.

So if you do this, you indeed get the exact same representation of TFIDF weights in transform, but thats because you actually get the same TfidfVectorizer object. As a consequence, marisa part is ignored/useless, and you get a model with the exact same size as if you had just fitted a normal TfidfVectorizer (-> no compression at all)

You would have to do something like

super().fit(raw_documents, y)
super().__init__(**{**self.get_params(),
                                  'vocabulary': Trie(self.vocabulary_.keys())}) 
# or self.vocabulary = Trie(self.vocabulary_.keys())
super().fit(raw_documents, y)

That is, fit a first time to get the vocabulary, use this vocabulary to get a Trie representation and set Trie representation as a precomputed vocabulary parameter to second call to fit, but in that case it is equivalent to the code above in discussion. And you end up with 2 calls to fit (of course the second one doesn't cost much, because vocab is already precomputed), and also with the vocab stored both in vocabulary_ and vocabulary attributes.

Using the code above, i get a 2% decrease in accuracy on IMDB dataset, but vectorizer size once pickled goes from 373.6 Mo to 99.6 Mo, which represent almost a x4 compression! This if for a dataset of 25k reviews of train, 25k reviews of test. The 25k train reviews on which the vectorizer is fitted leads to a 5 165 791 tokens vocabulary (for ngram_range=(1, 3) and other parameters set to sklearn default values). Ive made no finetuning neither feature selection method (as it is not our concern), but maybe that could also smooth the 2% performance decrease.

I don't know much about Trie structures, but the idea is that vocabulary is stored as a marisa-trie rather than a python dict, which change not only the mapping word<->index itself but also the underlying C++ implementations to store the data and access elements. For that reason, I am not suprised if there are some minor differences in the tfidf vectors (maybe some hashing conflicts ?)

Regarding word <-> index mapping change:

I don't know much about Trie structures, but the idea is that vocabulary is stored as a marisa-trie rather than a python dict, which change not only the mapping word<->index itself but also the underlying C++ implementations to store the data and access elements. For that reason, I am not suprised if there are some minor differences in the tfidf vectors (maybe some hashing conflicts ?)

When we initialize trie as follows:

self.vocabulary_ = Trie(self.vocabulary_.keys())

we pass only keys of vocabulary, i.e. Trie initializer knows nothing about values. Of course key-value pairs will be different, there is no guarantee that they will be at least a little bit similar to key-value pairs created before by sklearn vectorizer.

To preserve mapping, we need to pass info about keys and values to Trie initializer, not only keys. In general, trie structure changes inner implementation of storage, but it shouldn't change key-value pairs.

Marisa trie documentation looks scarce, but there is something about initialization with key-value pairs:

https://github.com/pytries/marisa-trie/blob/master/docs/tutorial.rst#marisa_trierecordtrie

There is no info about how to initialize marisa_trie.Trie with key-value pairs. But there is another structure, marisa_trie.RecordTrie, looks like it will be suitable (because it supports required initialization). Though, we will need a wrapper because this structure works with lists of values, but we need only single values.

Anyway, only 2% decrease in accuracy on IMDB dataset is great.

el-cornetillo / senti-py

Bug in MarisaTfidfVectorizer initialization #12