alvations / pywsd

Python Implementations of Word Sense Disambiguation (WSD) Technologies.
MIT License
744 stars 132 forks source link

Cached signatures could be replaced by json to improve performance #72

Open mihal277 opened 2 years ago

mihal277 commented 2 years ago

Hi everyone. I'm working with pywsd, which I find to be a very helpful library.

The issue I'm having is loading time and memory usage. I'm using mostly adapted_lesk. On my computer it takes around 2.5 seconds to warm up the lib and then it uses more than a gigabyte of ram.

I've made some experiments and I've noticed that the longest thing when loading the library is pd.read_pickle(signatures_picklefile).

I've checked that the used pickle protocol is 2. When I tried using version 5, the loading time of this file drops from 1.58 seconds to 0.98, which already is a significant drop.

But then I tested json and the improvement is even better. Namely it only takes 0.8 seconds to load on my computer, which is almost a 50% drop.

Moreover, the memory usage also drops significantly (similarly it's around 50%).

So I think replacing the pickle/pandas with json/dict could be an improvement.

Another thing I would like to ask is: is it necessary to load all the modules in the __init__.py file? If I use only adapted_lesk, do the other modules have to be loaded? Specifically pywsd.similarity loads some stuff, which also takes time and uses memory.

I wonder what you think of this simple change (i.e. replacing pickle with json)? I'd be happy to work on it.