Open WojciechMula opened 5 years ago
In the end I want to clean up my two pet peeves: 1) replace tabs with spaces and 2) move C files to subdirectory src.
@pombredanne Philippe, do you have any comments, ideas? Is there anything which is now annoying and can be fixed.
So I never rebased it because I was not really happy with it as is, but this patch could be made nicer if based against a pyahocorasick with structure outlined below
https://github.com/frankier/pyahocorasick/commit/433ca1be93e94b07c10dc8b5b05442d6c13a36fc
So my request would be that the C module is moved to _ahocorasick and then everything re-exported from ahocorasick so that "porcelain" like that patch can be implemented in Python.
@WojciechMula sorry for the late reply!... here are a few ideas, in no specific order:
@frankier re
So my request would be that the C module is moved to _ahocorasick and then everything re-exported from ahocorasick so that "porcelain" like that patch can be implemented in Python.
Your patch is in Python... so I am not sure I get the requirement. Can you elaborate?
So the idea is to make it so Python and non Python stuff can be imported from ahocorasick together. This is typically done by naming the C module _mymodule
and re-exporting everything from a Python module mymodule
. You can see use of this pattern in the standard library here: https://github.com/python/cpython/blob/master/Lib/pickle.py#L39
TBH I would be happy to remove plain-python module. It hasn't been upgraded for ages and seems alternatives are more mature.
@WojciechMula re:
TBH I would be happy to remove plain-python module. It hasn't been upgraded for ages and seems alternatives are more mature.
I am fine too. What you could do it move it to tests or a gist and explain this was used as an early prototype or not do anything at all and just remove it. It is in the history in anycase.
Some notes:
- Remove value_type from Automaton constructor (i.e. STORE_ANY, STORE_INTS, STORELENGTH). Using STORE{INT,LENGTH} makes sense at C-language level, but in Python isn't good. I mean, even if we save bare integers underneath, they have to be converted back to Python object. Removing this might also improve memory usage. To be honest, STORE_LENGTH is stupid, I suspect it was added because "I could do it".
I guess there are two ways I think of this:
As an example of some of the things I store see https://github.com/nexB/scancode-toolkit/blob/d1e725d3603a8f96c25f7e3f7595c68999b92a67/src/licensedcode/match_aho.py#L52 Here I store some int id for some object that represents what was matched plus some extra start and end positions. This could be a simple integers alright and we could have a separate plain Python mapping that maps the integer to an actual object to return. This could simplify the code?
- Rework wildcard matching. Use syntax from glob module, i.e. always the '?' marks "any character". It will require a bit of parsing, but the current solution is silly and unintuitive.
Frankly I would drop entirely wildcard matching as this is both a rare feature of such an automaton and I am not sure anyone is using this. I could see the value to have a regex engine leveraging the automaton to speed up matching (say more or less like esmre or may be parts of hyperscan) but I cannot how I would use the wildcard matching
- Rework wildcard matching. Use syntax from glob module, i.e. always the '?' marks "any character". It will require a bit of parsing, but the current solution is silly and unintuitive.
I can give this a shot. Glob-style pattern matching should be possible if you slightly adjust the stack-based iterator in AutomatonItemsIter.c
. I will work on a PR that supports the ?
(match any single char) and *
(match anything) wildcard patterns (perhaps also supports escaping them). I have extensively used similar techniques in a couple of projects to query large Tries with wildcard patterns and the performance often was significantly better than using regexes or Esmre. It is also more use friendly since glob patterns are much simpler and more intuitive than regex, especially when the queries tend to contain many special characters. Glob patterns also force the caller to "keep it simple".
@melsabagh-kw this looks like a great idea! I love it. Note that AFAIK, a glob can be converted to a regex (at least Python can translate https://docs.python.org/3/library/fnmatch.html#fnmatch.translate ) but this is a much simpler syntax than full regex and therefore better IMHO as you pointed out!
I submitted a PR at https://github.com/WojciechMula/pyahocorasick/pull/171
I renamed this to target a future v3.0. I am working on a v2.0 today and this is not as ambitious as what we have here.
Few things I consider for the next, incompatible version (in random order):
STORE_ANY
,STORE_INTS
,STORE_LENGTH
). UsingSTORE_{INT,LENGTH}
makes sense at C-language level, but in Python isn't good. I mean, even if we save bare integers underneath, they have to be converted back to Python object. Removing this might also improve memory usage. To be honest,STORE_LENGTH
is stupid, I suspect it was added because "I could do it".glob
module, i.e. always the '?' marks "any character". It will require a bit of parsing, but the current solution is silly and unintuitive.