Closed Boorinio closed 1 year ago
I'll have to take another look at the SciKit Learn Count Vectorizer (I'm assuming this was the one you were using in Python) but at first glance there's a difference in the datastructures used under the hood that has a big effect on memory usage.
https://github.com/scikit-learn/scikit-learn/blob/364c77e04/sklearn/feature_extraction/text.py#L931
Rubix ML Word Count Vectorizer uses PHP arrays under the hood whereas SciKit Learn's Count Vectorizer is using sparse NumPY arrays under the hood. I'm pretty sure int/float scalars in PHP arrays occupy > 128 bits when you account for the 64 bit float/int plus the extra zval "metadata" like reference count, and then another 64 bits to store the index. In contrast, NumPY arrays do not store a separate index, nor any extra metadata, and scalars need not be 64 bit, they can go as low as 8 bits I believe. Combine that with a sparse implementation (0's are not explicitly allocated in memory) you get a huge memory savings with the Python implementation when used to represent word count vectors.
Alright thanks for the answer!
Hello, I noticed during training a model with WordCountVectorizer for category classification and english stemmer with a vocabulary of 3k words and a small MLP model my ram usage jumps to 26 gbs. Do we know what's causing this? in order to get similar ram usage in python i need to make a vocabulary of 60-70k words approximately. Am i doing smth wrong maybe?
Best Regards and thanks for the hard work!