cvangysel / pytrec_eval

pytrec_eval is an Information Retrieval evaluation tool for Python, based on the popular trec_eval.
http://ilps.science.uva.nl/
MIT License
282 stars 32 forks source link

Memory leak problem #15

Open navid-rekabsaz opened 5 years ago

navid-rekabsaz commented 5 years ago

Hi,

Thanks for sharing the tool! It is indeed a very useful one!

I noticed a memory leak problem when running evaluator.evaluate. It allocates memory slightly higher than the size of the run file, but never releases the memory. To reproduce it, I attached a simple program, with sample qrel and run files.

pytrec_eval_test.zip

Here is the results of memory profiling. As you can see del res does not release the memory, leading to allocation of large memory after several runs:

Line # Mem usage Increment Line Contents

79   55.574 MiB   55.574 MiB   @profile
80                             def runme():
81                                 
82   57.930 MiB    2.355 MiB       qrel = load_reference('qrels.txt')
83  106.215 MiB   48.285 MiB       run = load_candidate('run.txt')
84                                 
85                                 
86  107.570 MiB    1.355 MiB       evaluator = pytrec_eval.RelevanceEvaluator(qrel, {'map', 'ndcg'})
87                             
88  107.570 MiB    0.000 MiB       N = 100
89 1320.098 MiB    0.000 MiB       for i in range(1,N):
90 1320.098 MiB   18.855 MiB           res = evaluator.evaluate(run)
91 1320.098 MiB    0.000 MiB           del res

The problem made me go back to old style ac-hoc running of trec_eval. It would be great to get through it, and I am happy to help.

Best, Navid

seanmacavaney commented 4 years ago

This fix should take care of the problem! Copies of the qid and docno were not being freed properly.

Line #    Mem usage    Increment   Line Contents
================================================
    79   52.750 MiB   52.750 MiB   @profile
    80                             def runme():
    81                                 
    82   55.148 MiB    2.398 MiB       qrel = load_reference('qrels.txt')
    83  103.582 MiB   48.434 MiB       run = load_candidate('run.txt')
    84                             
    85  104.793 MiB    1.211 MiB       evaluator = pytrec_eval.RelevanceEvaluator(qrel, {'map', 'ndcg'})
    86                             
    87  104.793 MiB    0.000 MiB       N = 100
    88  123.664 MiB    0.000 MiB       for i in range(1,N):
    89  123.664 MiB   18.832 MiB           res = evaluator.evaluate(run)
    90  123.664 MiB    0.000 MiB           del res
Ricocotam commented 4 years ago

Great, I quit this tool for this reason, thanks for fixing it. Just a thing on your PR, it's big, author might not accept it directly. May be consider having several smaller PR for each issue ? Since they are not related

Thanks a lot for this work

seanmacavaney commented 4 years ago

@Ricocotam I agree in spirit (and it was originally the plan to have multiple PRs), but there's a lot of interdependence between the changes made (e.g., 3 of the 4 rely on the addition of a python wrapper around the extension). @cvangysel will this hold up the PR, and would it help speed up the process if I split them up? I could, but it would be a pain.