AmenRa / retriv

A Python Search Engine for Humans 🥸
MIT License
174 stars 20 forks source link

Getting Out of Memory Error #36

Closed ashutosh486 closed 6 months ago

ashutosh486 commented 6 months ago

Hi,

I have a dataset which has around 2million rows and each text is no more than 20 tokens. I tried building using SparseRetreiver

from retriv import SparseRetriever

sr = SparseRetriever(
  index_name="bm25",
  model="bm25",
  min_df=1,
  tokenizer="whitespace",
  stemmer="english",
  stopwords="english",
  do_lowercasing=True,
  do_ampersand_normalization=True,
  do_special_chars_normalization=True,
  do_acronyms_normalization=True,
  do_punctuation_removal=False,
)
collections = [{"id": id, "text": text} for id, text in zip(ids, descs)]
sr.index(collections)

My disc space is around 14GB and RAM is around 96GB with 24 processors. is there any option to chunk the data and index it one chunk at a time?

ashutosh486 commented 6 months ago

Resolved the issue by setting numpy cpu usage globally. Solution:

import os
os.environ["OMP_NUM_THREADS"] = "4" 
os.environ["OPENBLAS_NUM_THREADS"] = "4" 
os.environ["MKL_NUM_THREADS"] = "6" 
os.environ["VECLIB_MAXIMUM_THREADS"] = "4" 
os.environ["NUMEXPR_NUM_THREADS"] = "6"