WordCountVectorizer Memory Issue

Boorinio commented 1 year ago

Hello, I noticed during training a model with WordCountVectorizer for category classification and english stemmer with a vocabulary of 3k words and a small MLP model my ram usage jumps to 26 gbs. Do we know what's causing this? in order to get similar ram usage in python i need to make a vocabulary of 60-70k words approximately. Am i doing smth wrong maybe?

Best Regards and thanks for the hard work!

andrewdalpino commented 1 year ago

I'll have to take another look at the SciKit Learn Count Vectorizer (I'm assuming this was the one you were using in Python) but at first glance there's a difference in the datastructures used under the hood that has a big effect on memory usage.

https://github.com/scikit-learn/scikit-learn/blob/364c77e04/sklearn/feature_extraction/text.py#L931

Rubix ML Word Count Vectorizer uses PHP arrays under the hood whereas SciKit Learn's Count Vectorizer is using sparse NumPY arrays under the hood. I'm pretty sure int/float scalars in PHP arrays occupy > 128 bits when you account for the 64 bit float/int plus the extra zval "metadata" like reference count, and then another 64 bits to store the index. In contrast, NumPY arrays do not store a separate index, nor any extra metadata, and scalars need not be 64 bit, they can go as low as 8 bits I believe. Combine that with a sparse implementation (0's are not explicitly allocated in memory) you get a huge memory savings with the Python implementation when used to represent word count vectors.

Boorinio commented 1 year ago

Alright thanks for the answer!

RubixML / ML

WordCountVectorizer Memory Issue #302