amittai / cynical

Cynical data selection
MIT License
20 stars 7 forks source link

Vocab ratios calculation in python script very slow #5

Closed jsedoc closed 3 years ago

jsedoc commented 4 years ago

The calculation of the vocab ratios (https://github.com/amittai/cynical/blob/master/python_cynical_wrapper.py#L206) is very slow in the python wrapper script and it does not appear to be necessary.

Potential fixes:

  1. remove the vocab ratios calculation
  2. call the Perl script (which seems to be significantly faster)
  3. refactor the python code ( add possibly add cython)
amittai commented 4 years ago

pinging @Ssloto

Ssloto commented 4 years ago

You're totally right about this one, @jsedoc . I'm pretty sure that this needs to get axed. I'd be down to do #3 or #1. When would you need this by? I can probably get to it around the weekend.

Ssloto commented 4 years ago

I did a simple version of #3 that includes a fair amount of major cleanup. There's a lot of stuff that really was... messed up in the version of the Python wrapper that was up before.

I got some random files off OPUS and ran them with the old & new versions w/ Python 3.8. As far as I can tell, new version output is sensible and it doesn't hang indefinitely on vocab ratios.

file sizes: repr = 57,441 avail = 73,505 seed = 79

old: 1m24.733s new: 0m2.120s

that's a hecking speed improvement.

there are probably some other things I can do to de-horrify the wrapper, but I'd prefer just porting the Perl to Python in an efficient manner. Should be do-able.

Ssloto commented 4 years ago

Let me know if this resolves the issue for you, or if you find any other bugs!