WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
952 stars 125 forks source link

Plans for future 3.0.0 branch #100

Open WojciechMula opened 5 years ago

WojciechMula commented 5 years ago

Few things I consider for the next, incompatible version (in random order):

  1. Remove value_type from Automaton constructor (i.e. STORE_ANY, STORE_INTS, STORE_LENGTH). Using STORE_{INT,LENGTH} makes sense at C-language level, but in Python isn't good. I mean, even if we save bare integers underneath, they have to be converted back to Python object. Removing this might also improve memory usage. To be honest, STORE_LENGTH is stupid, I suspect it was added because "I could do it".
  2. Rework wildcard matching. Use syntax from glob module, i.e. always the '?' marks "any character". It will require a bit of parsing, but the current solution is silly and unintuitive.
  3. Support any iterator-like object in search. We have just gained support for tuples, which is great, but more generic code -- even if not specialised and fast -- seems be useful. (There was a question about mmaped files, for instance).
  4. Change also a little memory layout: when node has one child store pointer directly, rather allocating subarray of size 1.
  5. Remove kind "EMPTY" --- it's a technical detail; an automaton should have got always the "root" node, that makes adding and removing words a bit easier.
  6. Rewrote pickling mechanism. Now plain C structures by the module are mapped almost directly, which is not a good idea; the future format should be tailored to pickling needs, and also more validation has to be done put there.
WojciechMula commented 5 years ago

In the end I want to clean up my two pet peeves: 1) replace tabs with spaces and 2) move C files to subdirectory src.

WojciechMula commented 5 years ago

@pombredanne Philippe, do you have any comments, ideas? Is there anything which is now annoying and can be fixed.

frankier commented 5 years ago

So I never rebased it because I was not really happy with it as is, but this patch could be made nicer if based against a pyahocorasick with structure outlined below

https://github.com/frankier/pyahocorasick/commit/433ca1be93e94b07c10dc8b5b05442d6c13a36fc

So my request would be that the C module is moved to _ahocorasick and then everything re-exported from ahocorasick so that "porcelain" like that patch can be implemented in Python.

pombredanne commented 5 years ago

@WojciechMula sorry for the late reply!... here are a few ideas, in no specific order:

pombredanne commented 5 years ago

@frankier re

So my request would be that the C module is moved to _ahocorasick and then everything re-exported from ahocorasick so that "porcelain" like that patch can be implemented in Python.

Your patch is in Python... so I am not sure I get the requirement. Can you elaborate?

frankier commented 5 years ago

So the idea is to make it so Python and non Python stuff can be imported from ahocorasick together. This is typically done by naming the C module _mymodule and re-exporting everything from a Python module mymodule. You can see use of this pattern in the standard library here: https://github.com/python/cpython/blob/master/Lib/pickle.py#L39

WojciechMula commented 5 years ago

TBH I would be happy to remove plain-python module. It hasn't been upgraded for ages and seems alternatives are more mature.

pombredanne commented 5 years ago

@WojciechMula re:

TBH I would be happy to remove plain-python module. It hasn't been upgraded for ages and seems alternatives are more mature.

I am fine too. What you could do it move it to tests or a gist and explain this was used as an early prototype or not do anything at all and just remove it. It is in the history in anycase.

pombredanne commented 2 years ago

Some notes:

  1. Remove value_type from Automaton constructor (i.e. STORE_ANY, STORE_INTS, STORELENGTH). Using STORE{INT,LENGTH} makes sense at C-language level, but in Python isn't good. I mean, even if we save bare integers underneath, they have to be converted back to Python object. Removing this might also improve memory usage. To be honest, STORE_LENGTH is stupid, I suspect it was added because "I could do it".

I guess there are two ways I think of this:

  1. only store integers. Actual mapping to a Python object is done elsewhere... It could be done in Python.
  2. only store arbitrary pickable object.

As an example of some of the things I store see https://github.com/nexB/scancode-toolkit/blob/d1e725d3603a8f96c25f7e3f7595c68999b92a67/src/licensedcode/match_aho.py#L52 Here I store some int id for some object that represents what was matched plus some extra start and end positions. This could be a simple integers alright and we could have a separate plain Python mapping that maps the integer to an actual object to return. This could simplify the code?

  1. Rework wildcard matching. Use syntax from glob module, i.e. always the '?' marks "any character". It will require a bit of parsing, but the current solution is silly and unintuitive.

Frankly I would drop entirely wildcard matching as this is both a rare feature of such an automaton and I am not sure anyone is using this. I could see the value to have a regex engine leveraging the automaton to speed up matching (say more or less like esmre or may be parts of hyperscan) but I cannot how I would use the wildcard matching

melsabagh commented 2 years ago
  1. Rework wildcard matching. Use syntax from glob module, i.e. always the '?' marks "any character". It will require a bit of parsing, but the current solution is silly and unintuitive.

I can give this a shot. Glob-style pattern matching should be possible if you slightly adjust the stack-based iterator in AutomatonItemsIter.c. I will work on a PR that supports the ? (match any single char) and * (match anything) wildcard patterns (perhaps also supports escaping them). I have extensively used similar techniques in a couple of projects to query large Tries with wildcard patterns and the performance often was significantly better than using regexes or Esmre. It is also more use friendly since glob patterns are much simpler and more intuitive than regex, especially when the queries tend to contain many special characters. Glob patterns also force the caller to "keep it simple".

pombredanne commented 2 years ago

@melsabagh-kw this looks like a great idea! I love it. Note that AFAIK, a glob can be converted to a regex (at least Python can translate https://docs.python.org/3/library/fnmatch.html#fnmatch.translate ) but this is a much simpler syntax than full regex and therefore better IMHO as you pointed out!

melsabagh commented 2 years ago

I submitted a PR at https://github.com/WojciechMula/pyahocorasick/pull/171

pombredanne commented 1 year ago

I renamed this to target a future v3.0. I am working on a v2.0 today and this is not as ambitious as what we have here.