add option to omit removing n-grams with space

Laubeee commented 4 months ago

I'd much prefer an option to keep spaces than having to replace my spaces with a special character that is otherwise not used.

If you do not want to add this change, please at least consider properly documenting this behaviour of removing n-grams with spaces in them, I just spent way too much time figuring out what was going on.

I also removed the use_knn=False in the example, since this option does not exist (anymore?)

MaartenGr commented 4 months ago

If you do not want to add this change, please at least consider properly documenting this behaviour of removing n-grams with spaces in them, I just spent way too much time figuring out what was going on.

I understand the frustration but let's take a step back first. Could you show what issue it is exactly that you are facing? Why is it important for your use case to have this change? That way, it is easier for me to understand why this change might be necessary or if there are other ways to support it.

Generally, these kinds of changes are first discussed in an open issue but it's alright for me to discuss it further here.

Laubeee commented 4 months ago

I think the basic use-case is there:

I'd much prefer an option to keep spaces than having to replace my spaces with a special character that is otherwise not used.

but i can certainly provide some more details. I am matching some input from a scanned document against a list of materials, that can be rather cryptic. I had somewhat unexpected behaviours but I finally figured something was definitely off when "PE" matched with "PE 1" with 100% as "first match" when there was actually also "PE" in the list, which made me go to look at the code and find out. This could be addressed by adding 1-grams, but I don't think it's a good solution.

The main issue is certainly that this behaviour is not documented. Once I know what is going on it's not a big thing to simply pre-process the input accordingly, i.e. replace spaces (or any whitespace, really) with some special character ("_" is a good candidate, but is already used so I might use "#" or whatever). But I don't think this is a good solution either, it feels much more like a workaround than a configuration. If word order matters, the n-grams with space simply should not be ignored... Also, given that there is already an option present that does some pre-processing (clean_string), I think adding a flag making this behaviour optional would be more consistent with the API while also automatically documenting the default behaviour.

Side note: not having this option would make it complicated to work-around to "enable" spaces when clean_string=True ... you'd have to fall back to doing the clean_string operation yourself.

Lastly: I think the _create_ngrams function is the most elegant place to have any kind of text preprocessing. You configure it once and don't have to care everywhere you use the thing to apply your own preprocessing first. I am currently also thinking about adding ^ before and $ after every text, inspired by the <start> and <end> tokens some systems use. With this, there would not be need to use 2-grams (which i currently have to since some entries are just 2 characters long), and the whole thing becomes more "robust" against falsely matching longer strings that contain such a short sequence. That said, it might be worth having a preprocessing-callback that defaults to the clean_string operation but can be changed to anything that is desired, instead of adding ever more preprocessing options. BUT: the spacing problem would not be solved with that hook. In a similar fashion one could also pass a string of ignore-characters, that default to space, and every n-gram that contains any of the passed characters is ignored. To disable, one can then just pass an empty string. This would provide the best API IMO, but obv. the downside is that it requires breaking changes.

MaartenGr commented 4 months ago

Thank you for sharing the extensive description. I can imagine wanting to keep the spaces in certain use cases as they might relate to the words/characters that you want to compare. Seeing as this is a relatively low-impact change, I'll go ahead and merge this. For any other suggestions, it would be best to open up a new issue.

MaartenGr / PolyFuzz

add option to omit removing n-grams with space #76