fergiemcdowall / search-index

A persistent, network resilient, full text search library for the browser and Node.js
MIT License
1.38k stars 149 forks source link

string values consisting only of special chars break tokenization chain #577

Closed jakobsa closed 2 years ago

jakobsa commented 2 years ago

Sorry to bother you again. I have come across an issue that I can quite well point to.

If as in this edited example:

https://replit.com/@jakobsa1/search-index-QUERY-2#index.js

a string value that consists only of special chars is tokenized, the regex within the SPLIT stage will return null and cause the chain to throw an error. I have mitigated the behavior with a custom stage that runs after SPLIT and basically only recreates the tokens array

.then(([tokens, field, ops]) => [
                        tokens ?? [],
                        field,
                        ops
                    ])
fergiemcdowall commented 2 years ago

Thanks for the bug report @jakobsa - I am still on paternity leave, so not much time at the computer, but will try to take a look at this ASAP

fergiemcdowall commented 2 years ago

@jakobsa I can fix the crash by updating to search-index@3.0.3 here

However, I guess that you want to be able to search for '*' and there seems to be a bug there. I will try to roll out a fix.

fergiemcdowall commented 2 years ago

@jakobsa you can now index and search for asterisks ('*') in search-index@3.1.0using tokenSplitRegex. The default is /[\p{L}\d]+/gu, in order to preserve asterisks you would change it to /[\p{L}\d*]+/gu.

jakobsa commented 2 years ago

That is a great new feature. In my case I did not need to find the * field itself, but the need might arise for other special char only fields. Thanks a lot!