Memory leak in TopicRank

NathancWatts commented 1 year ago

I am running a distributed processing pipeline using Dask that is ingesting millions of documents, and I have found that the workers are quickly running out of memory to a memory leak in TopicRank over the course of ~500 batches of 1000 documents each (each worker has 14 gb, should be plenty).

I have the spacy model pre-loaded and stored in the worker, so the model is only being loaded once for each worker thread, but I am re-initializing the topicrank keyphrase extractor for each document. Code is something along the lines of:

 extractor.candidate_selection()
 extractor.candidate_weighting()
 keyphrases = extractor.get_n_best(n=n)

I am confident that the leak is occurring in these four lines, as I've moved a return [] through the function and found that the memory leak appears if I return after these lines but not before these lines. I've tried to find the exact location of the leak with tracemalloc while running single-threaded but unfortunately that runs much too slowly to make any progress on diagnosing it.

The leak isn't resolved by explicitly deleting the extractor, using gc.collect(), or libc.malloc_trim(), so it's not just unclaimed memory-- something isn't tracked and isn't getting freed.

ygorg commented 1 year ago

EDIT: My bad we fixed this behaviour a while ago. Please share more code, in python there should not be memory leaks, there might be something that accumulates data ?

I think it might be because you use only one TopicRank object. For example if I process many files i'll do something like:

# extractor = TopicRank() # Not that !!!
for d in docs:
    extractor  = TopicRank()  # That
    extractor.load_document(d)
    extractor.candidate_selection()
    extractor.candidate_weighting()
    keyphrases = extractor.get_n_best(n=n)

It is very important that the extractor is recreated for each different document. extractor.load_document does not reset extractor.candidates, so candidates are accumulating with each call to extractor.candidate_selection.

Also please note that pke is a research tool, and not suited for production uses. I would advise to extract only the code useful for your use case and optimizing it instead of using pke as is ! But I'm very glad to know that you use it large scale !!

NathancWatts commented 1 year ago

Sure thing; I am saving the spacy model within the "worker" object here, but I'm re-initializing the PKE extractor between each document. This is all (effectively) happening within the loop:

def extract_keyphrases(text=None, lang='en', n=20, stoplist = []):
    extractor = pke.unsupervised.TopicRank()
    worker = get_worker()
    # Check if the spacy model is already loaded. if it isn't, load it now and save it in the worker
    try:
        pke_model = worker.pke_model
    except AttributeError:
        import spacy
        pke_model = spacy.load("en_core_web_sm")
        worker.pke_model = pke_model

    extractor.load_document(input=text, language=lang, stoplist = stoplist, spacy_model = pke_model)
    extractor.candidate_selection()
    try:
        extractor.candidate_weighting()
    except:
        return list()

    keyphrases = extractor.get_n_best(n=n)
    return keyphrases

ygorg commented 1 year ago

The code looks fine, if you try with the "FirstPhrases" extractor (which is simpler) do you still have this issue ? And how many candidates are extracted in the documents (if you map len(extractor.candidates) for each document), maybe this can give insight. You can also try to preprocess the documents with spacy and pass these to extract_keyphrase like this:

Code

```python def preprocess(text, lang): try: pke_model = worker.pke_model except AttributeError: import spacy pke_model = spacy.load("en_core_web_sm") worker.pke_model = pke_model return pke_model(text) def extract_keyphrases(doc=None, lang='en', n=20, stoplist = []): extractor = pke.unsupervised.TopicRank() extractor.load_document(input=doc, language=lang, stoplist = stoplist) extractor.candidate_selection() try: extractor.candidate_weighting() except: return list() keyphrases = extractor.get_n_best(n=n) return keyphrases doc = map(preprocess, doc) kps = map(extract_keyphrases, doc) ```

Apart from this I don't if I can be of any more help :(

NathancWatts commented 1 year ago

Thanks for the suggestions! Using "FirstPhrases," the memory leak still appears to occur, but seems to build up a bit slower; however, quite excitingly, when I moved the preprocessing step to a separate function, the memory leak appears to be resolved! (or at least very very significantly reduced.) Could it be that the extractor is hanging on to a copy of the spacy model or something when passed into load_document()? edit: It's possible that the memory leak actually is somewhere in RawTextReader, from digging into what the difference could be. I'm going to let it keep running just to make sure the issue is resolved and keep at it but after 300 iterations I should have seen it by now. Very exciting! Thank you very much.

ygorg commented 1 year ago

Thanks for your experiments ! If loading documents before hand reduces memory usage then I'm closing this issue for now.

boudinfl / pke

Memory leak in TopicRank #227