AmenRa / ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
https://amenra.github.io/ranx
MIT License
427 stars 23 forks source link

[Feature Request] memory issue / make Run more efficient #48

Closed PaulLerner closed 1 year ago

PaulLerner commented 1 year ago

Hi Elias,

Is your feature request related to a problem? Please describe. I've noticed that Run (and I guess also Qrels) consume a lot of memory (RAM) compared to standard python dict, e.g. a few GB instead of a few 100s of MB. This gets problematic for somewhat large datasets (e.g. 1M queries)

Describe the solution you'd like I guess it's related to Numba representation? I've no clue on how to make it more efficient, sorry :)

Reproduce Just open your system monitor and see how the memory grows.

In [1]: import ranx
# this weighs only a few 100s of MB
In [2]: run_d = {str(i): {str(j): 0.0 for j in range(100)} for i in range(100000)}
# this grows to a few GB
In [3]: run_r = ranx.Run(run_d)

Best,

Paul

AmenRa commented 1 year ago

Dear Paul,

The issue is probably due to numba (I cannot do anything about that) and a forced conversion of the strings used as ids to numba.types.unicode_type that I introduced to avoid errors when I implemented the fusion algorithms.

I have tested a snippet similar to yours (I do not save the Python dict in memory) with and without the conversion. Memory usage went down from 2.42 GB to 1.59 GB (including ranx import). The Python dict alone is around 1.15 GB.

I'll try to remove the forced conversion without breaking the fusion algorithms and get back to you.

Thanks for pointing it out!

AmenRa commented 1 year ago

Fixed in v0.3.15.