WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
927 stars 122 forks source link

Support typed array as an input when storing sequences #113

Open pombredanne opened 5 years ago

pombredanne commented 5 years ago

The STORE_SEQUENCE feature introduced with #27 is great but it works only from tuples as opposed to more general sequences of integers. In particular support array.array types would be great. Arrays store integers in a much more compact way than tuples.

>>> from pympler.asizeof import asizeof as s
>>> from array import array
>>> t=tuple(xrange(15000))
>>> a=array('h', xrange(15000))
>>> s(a)
31024
>>> s(t)
480056

This is because the data structure is limited to a single type and eventually you are able to store short, long, floats in the most appropriate fixed size type. So this request is for an enhancement to allow these two things:

  1. using array as a sequence type
  2. honor the the type of the array, e.g. store shorts/long/double exactly and not something else (such as 32 bits on Py 3 or 16 bits on Py 2 as it is now).

Note that the two could be implemented separately somehow: you could specify the integer sequence type at construction time for instance and use that for any sequence and accept various sequence types when adding "words". Or we could just add support for typed array and get the int type from the array instead too.