Open ganesh-morsu opened 11 months ago
That is most likely the result of a large vocabulary. Setting min_df
to have a value higher than 1 will reduce the necessary RAM. You can do that by using a custom TF-IDF model.
I have created custom TF-IDF model ,Tried with increasing min_df
value, Still i am facing same issue.
Below is the code i have created custom model.
from polyfuzz.models import TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
class CustomTFIDF(TFIDF):
def __init__(self,
n_gram_range=(3, 3),
clean_string=True,
min_similarity=0.75,
top_n=1,
cosine_method="sparse",
model_id=None,
min_df_custom=2): # Add a custom parameter for min_df
super().__init__(n_gram_range, clean_string, min_similarity, top_n, cosine_method, model_id)
self.min_df_custom = min_df_custom # Set the custom min_df value
def _extract_tf_idf(self,
from_list,
to_list=None,
re_train=True):
if to_list:
if re_train:
# Customize the TfidfVectorizer with min_df
self.vectorizer = TfidfVectorizer(min_df=self.min_df_custom, analyzer=self._create_ngrams).fit(
to_list + from_list)
self.tf_idf_to = self.vectorizer.transform(to_list)
tf_idf_from = self.vectorizer.transform(from_list)
else:
if re_train:
# Customize the TfidfVectorizer with min_df
self.vectorizer = TfidfVectorizer(min_df=self.min_df_custom, analyzer=self._create_ngrams).fit(
from_list)
self.tf_idf_to = self.vectorizer.transform(from_list)
tf_idf_from = self.tf_idf_to
return tf_idf_from, self.tf_idf_to
You can try setting the min_df
value to much higher than 2. Setting it to at least 10 is most likely to help out.
I am facing same issue ,even after i have changed higher value. I have tried with min_df = 10 and min_df = 15 and min_df = 20
The error i am getting MemoryError: Unable to allocate 207. GiB for an array with shape (27815314339,) and data type int64
Have you tried using pip install polyfuzz[fast]
? I believe it should reduce the memory allocation here. Also, you can use "knn" instead of "sparse" to reduce memory. I would advise trying out these two options.
I have data contains around 166793 Records, I want to fit this records for TF-IDF Model
Here i am facing the issue while fitting the model ,The server getting killed (I have tried with configuration of 20 gb ram). Is there any solution?