WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
948 stars 125 forks source link

Pickle rewrite - help needed #70

Closed WojciechMula closed 5 years ago

WojciechMula commented 6 years ago

Bugs related to pickling are recurrent and annoys everybody; sometimes a bug causes crash of the interpreter which is completely unacceptable. I tried my best to track the problem(s) down, but I failed. Moreover, last year was tough for me (I was ill, then I bought and was renovating a flat, finally recent changes in ex-company had forced me to seek for a new job) and as a result I couldn't spend much time on side projects.

This project is pretty popular, and it would be great if somebody helped with a pickling algorithm. IMO the best option is to trash the current one and start over.

pombredanne commented 6 years ago

@WojciechMula I do not have any pickling bugs on the Python 2.7 build of 1.1.4 and this will thousands of users of scancode-toolkit on Linux, Windows and macOS.

Moreover, last year was tough for me (I was ill, then I bought and was renovating a flat, finally recent changes in ex-company had forced me to seek for a new job) and as a result I couldn't spend much time on side projects.

You owe none anything my friend! I hope you new job rocks!

That said, pickling is a not a great protocol. I would be quite happy with a custom binary format and protocol that eschews pickling entirely and have a similar purpose and effect. So do not be stuck on pickling

pombredanne commented 6 years ago

As an example https://github.com/RoaringBitmap/RoaringFormatSpec/ this is to store compressed bitmaps in C, Java and Go.

pombredanne commented 6 years ago

Another example of the eventually complexity of pickling here in pure Python for a trie structure: https://github.com/google/pygtrie/blob/master/pygtrie.py#L187 and https://github.com/google/pygtrie/blob/master/pygtrie.py#L261

pombredanne commented 6 years ago

So please by all mean let go of pickle if this can make your life simpler!

WojciechMula commented 6 years ago

@pombredanne It's good to hear that you don't have any problems with pickling, but unfortunately there are some. I feel really uncomfortable that the module, which people like and use, doesn't work well and users waste precious time. Unless I resign, I am responsible for the module.

Speaking of pickle format, I think we must stick to python machinery as the module uses both C internal structures and python objects (i.e. values stored in the trie).

pombredanne commented 6 years ago

@WojciechMula your call with continuing to use pickle... but that not a feature IMHO. The feature is IMHO reasonably fast writing and reading of an automaton to and from disk (and thinking of it, using your own format would mean it could be memory mapped in the future .... yummy)