WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
929 stars 122 forks source link

Use PEP393 unicode API, switching to UCS4 internally for py>=3.3 & extend range of trie letters #90

Closed frankier closed 5 years ago

frankier commented 5 years ago

In making this PR, my main aim was to extend the range of ints which can be stored in the trie to 32 bits. My understanding of PEP393 is that it means that now a Python Unicode object could theoretically be using any encoding. I have used this as an opportunity to change the internal representation to UCS-4. Depending on whether a Python Unicode string we get our hands on is UCS-4 or not, we may now have to create a copy, and most of this patch is bookkeeping to deal with the different memory & reference management cases this results in. Finally, the check is changed to depend on the trie letter size. This means as long as we're using Python >= 3.3 we can guarantee we're able to store 32 bit integers in our automaton, enough e.g. to store the word vocabulary of natural language. This is related to the use case given in https://github.com/WojciechMula/pyahocorasick/pull/89

frankier commented 5 years ago

Okay, this gets rid of a lot of the macros. I agree simpler is better.

WojciechMula commented 5 years ago

@frankier Thanks for the changes and documentation update. I have just one more comment and we're ready to merge. :)

frankier commented 5 years ago

Hi. Did you forget to post the comment? :smiley_cat:

WojciechMula commented 5 years ago

@frankier I seldom use GitHub review tools. I was sure that my comment was visible... sorry for that.

WojciechMula commented 5 years ago

@frankier now I ping you :)

frankier commented 5 years ago

Yep, looks like you're right. Fixed.

WojciechMula commented 5 years ago

@frankier Thank you very much, merged! Now we can move forward with your other changes.