Open fractaledmind opened 9 years ago
Hey Stephen,
Thanks for your thoughts. There's still a lot left to do on this implementation of Whitaker's original work, specifically with building his dictionary the way that he does here. As is, open_words still has some issues with accuracy that need to be sorted out.
SQLite was also my inclination for increasing the performance of the module once we get there--though we could try other methods if you wanted to implement them. It seems like making the original dictionaries, stems, inflects, and uniques he used available in some format might also be helpful to others in the future. Would love your PR in the meantime.
Cheers and thanks again,
Luke
I would agree that SQLite is probably the best place to go for this type of data (frequently read/queried, and thoroughly static). It is insanely fast (especially compared to the filtering Python functions currently in the code) and super stable.
Unfortunately, I don't really know ADA (is that right? I can't remember the exact language name right now), so it can be hard to follow Whitaker's source (I've read some bits a few times), but I'm willing to help this project along in any way I can (having this project on a modern code-base is reason enough). If you have any specific tasks that need done that don't require thorough knowledge of the original code-base, send it my way and I'll get into it as and when I can.
@fractaledmind @lukehollis if either of you are still interested in doing this, you can check out my fork where I've done some things similar to what you described here: https://github.com/blagae/whitakers_words
Due to my changes, performance is up by at least an order of magnitude.
Hey guys,
This looks like it's shaping up into a great utility. I wanted to offer one quick piece of advice to aid RAM and speed performance.
Since you are essentially doing look-ups against a static dataset for the parsing, and since that dataset is quite large, you really need to move to more performant data formats as well as data retrieval. For example, the
uniques.py
file is a list of dicts, so you iterate over the list and check against every item. If you are going to iterate over a sequence, especially one so large, you should use a stream (iterable
oriterator
) to keep from loading the whole dataset into memory on each run. Then, you can use agenerator
to do the lookups. This will keep the memory usage to a low, and subsequently improve speed (takes time to load the whole list fromuniques.py
into memory).Alternatively, you could store the data in a hashable format and avoid iteration altogether. You could have a dict where the keys are possible inflection endings and the values are a list of forms with that ending. Then, you simply do a direct look-up on the ending when you get the word to parse as input. Or, you could store everything in some SQLite tables in a database and use the C-speed and strong query abilities to lookup data in a much more performant manner.
Unfortunately, I don't have the time right now to fork and implement, but I am willing to help from the backend right now. I would love to see this working and working FAST, and I'd love to help with the time that I do have. Hopefully, this makes sense, but if not, feel free to ask what doesn't.
Thanks for the great open-source project, stephen