Allow to override tokenizer separator

xoofx commented 4 years ago

Followup of the issue #3

Even If I'm able to workaround this by stripping punctuation before adding to document, it makes the fields barely usable after they have been squashed.

For example, in my case, I'm serializing 3 fields: url, title, content. For the url, I would prefer to be able to keep the real url (e.g /doc/api/general) but instead, I have do remove punctuation so that it becomes doc api general and the tokenizer is happy (it won't tokenize /)

In lunr, there is a way to override the tokenizer separator, though statically which is not great, but I believe this could be done per builder instance in LunrCore.

In my case, I could use a more appropriate regex separator like [^\w]+

bleroy commented 4 years ago

Your link points to the separator regular expression, which matches on whitespace and hyphens. Not sure what this has to do with slashes. As far as I can tell, the slashes are hard-coded into builder.js.

bleroy commented 4 years ago

It's always possible to make that pluggable of course, but if you do, you'd break compatibility with lunr.js. That may be fine in your scenario, let me know.

xoofx commented 4 years ago

Your link points to the separator regular expression, which matches on whitespace and hyphens. Not sure what this has to do with slashes. As far as I can tell, the slashes are hard-coded into builder.js.

Hm, ok, maybe I misunderstood, but the tokenizer is supposed to build tokens but if / is not a separator, it will start to consider that /doc/api/general is a single token/word, hence why I was seeing /doc/api/general in the inverted index which is unusable.

xoofx commented 4 years ago

It's always possible to make that pluggable of course, but if you do, you'd break compatibility with lunr.js. That may be fine in your scenario, let me know.

Precisely, it is pluggable in lunr, so I don't think it would break anything here

bleroy commented 4 years ago

I don't believe it is. We can make the whitespace and hyphens pluggable like in lunr (although in their case it's more that JavaScript lets you do anything than an extensibility mechanism), but I don't think that will help with slashes.

bleroy commented 4 years ago

Oooh, you want to add slashes to the list of separators? Then each token in the path is a separate token? I see. Yeah, we can do that.

xoofx commented 4 years ago

Ok, let's take another example without slashes. If I have a string with this(is,a,text), it will assume that this(is,a,text) is an entire word. There is no slash, but you have ( and ,

xoofx commented 4 years ago

I don't believe it is. We can make the whitespace and hyphens pluggable like in lunr (although in their case it's more that JavaScript lets you do anything than an extensibility mechanism), but I don't think that will help with slashes.

The documentation of the separator states that it is overrideable.

/* The separator used to split a string into tokens. Override this property to change the behaviour of
 * `lunr.tokenizer` behaviour when tokenizing strings. By default this splits on whitespace and hyphens.
*/

Oooh, you want to add slashes to the list of separators? Then each token in the path is a separate token? I see. Yeah, we can do that.

Yeah, this is why I'm using the regex [^\w]+ to match anything that is a non-word. This is what I use before passing my strings to LunrCore currently to avoid the issues with non-word characters being indexed.

bleroy commented 4 years ago

The documentation of the separator states that it is overrideable.

Oh I know. My remark is more agreeing with you that exposing a static property is not a proper extensibility mechanism.

xoofx commented 4 years ago

Oh I know. My remark is more agreeing with you that exposing a static property is not a proper extensibility mechanism.

Yeah, that's why I suggested to add it through the Builder instance instead 😉

xoofx commented 4 years ago

The reason why it is important to be able to override separators is that then after, If I want to extract an excerpt of the original content where the match occurs, I would loose all the punctuations while I would like to preserve them. So my workaround of stripping everything before indexing is "okish" to go through LunrCore, but definitely not great when you want to have correct positions back to your original text.

xoofx commented 4 years ago

I can do a PR if you want (my evening is coming, I have some sparetime to kill 😉 )

xoofx commented 4 years ago

Ok, I just made a small PR #10 so that you can get a quick glimpse of the kind of changes

bleroy / lunr-core

Allow to override tokenizer separator #9