Closed frankier closed 5 years ago
Okay, this gets rid of a lot of the macros. I agree simpler is better.
@frankier Thanks for the changes and documentation update. I have just one more comment and we're ready to merge. :)
Hi. Did you forget to post the comment? :smiley_cat:
@frankier I seldom use GitHub review tools. I was sure that my comment was visible... sorry for that.
@frankier now I ping you :)
Yep, looks like you're right. Fixed.
@frankier Thank you very much, merged! Now we can move forward with your other changes.
In making this PR, my main aim was to extend the range of ints which can be stored in the trie to 32 bits. My understanding of PEP393 is that it means that now a Python Unicode object could theoretically be using any encoding. I have used this as an opportunity to change the internal representation to UCS-4. Depending on whether a Python Unicode string we get our hands on is UCS-4 or not, we may now have to create a copy, and most of this patch is bookkeeping to deal with the different memory & reference management cases this results in. Finally, the check is changed to depend on the trie letter size. This means as long as we're using Python >= 3.3 we can guarantee we're able to store 32 bit integers in our automaton, enough e.g. to store the word vocabulary of natural language. This is related to the use case given in https://github.com/WojciechMula/pyahocorasick/pull/89