Closed xoofx closed 4 years ago
Your link points to the separator regular expression, which matches on whitespace and hyphens. Not sure what this has to do with slashes. As far as I can tell, the slashes are hard-coded into builder.js.
It's always possible to make that pluggable of course, but if you do, you'd break compatibility with lunr.js. That may be fine in your scenario, let me know.
Your link points to the separator regular expression, which matches on whitespace and hyphens. Not sure what this has to do with slashes. As far as I can tell, the slashes are hard-coded into builder.js.
Hm, ok, maybe I misunderstood, but the tokenizer is supposed to build tokens but if /
is not a separator, it will start to consider that /doc/api/general
is a single token/word, hence why I was seeing /doc/api/general
in the inverted index which is unusable.
It's always possible to make that pluggable of course, but if you do, you'd break compatibility with lunr.js. That may be fine in your scenario, let me know.
Precisely, it is pluggable in lunr, so I don't think it would break anything here
I don't believe it is. We can make the whitespace and hyphens pluggable like in lunr (although in their case it's more that JavaScript lets you do anything than an extensibility mechanism), but I don't think that will help with slashes.
Oooh, you want to add slashes to the list of separators? Then each token in the path is a separate token? I see. Yeah, we can do that.
Ok, let's take another example without slashes. If I have a string with this(is,a,text)
, it will assume that this(is,a,text)
is an entire word. There is no slash, but you have (
and ,
I don't believe it is. We can make the whitespace and hyphens pluggable like in lunr (although in their case it's more that JavaScript lets you do anything than an extensibility mechanism), but I don't think that will help with slashes.
The documentation of the separator states that it is overrideable.
/* The separator used to split a string into tokens. Override this property to change the behaviour of
* `lunr.tokenizer` behaviour when tokenizing strings. By default this splits on whitespace and hyphens.
*/
Oooh, you want to add slashes to the list of separators? Then each token in the path is a separate token? I see. Yeah, we can do that.
Yeah, this is why I'm using the regex [^\w]+
to match anything that is a non-word. This is what I use before passing my strings to LunrCore currently to avoid the issues with non-word characters being indexed.
The documentation of the separator states that it is overrideable.
Oh I know. My remark is more agreeing with you that exposing a static property is not a proper extensibility mechanism.
Oh I know. My remark is more agreeing with you that exposing a static property is not a proper extensibility mechanism.
Yeah, that's why I suggested to add it through the Builder instance instead 😉
The reason why it is important to be able to override separators is that then after, If I want to extract an excerpt of the original content where the match occurs, I would loose all the punctuations while I would like to preserve them. So my workaround of stripping everything before indexing is "okish" to go through LunrCore, but definitely not great when you want to have correct positions back to your original text.
I can do a PR if you want (my evening is coming, I have some sparetime to kill 😉 )
Ok, I just made a small PR #10 so that you can get a quick glimpse of the kind of changes
Followup of the issue #3
Even If I'm able to workaround this by stripping punctuation before adding to document, it makes the fields barely usable after they have been squashed.
For example, in my case, I'm serializing 3 fields:
url
,title
,content
. For the url, I would prefer to be able to keep the real url (e.g/doc/api/general
) but instead, I have do remove punctuation so that it becomesdoc api general
and the tokenizer is happy (it won't tokenize/
)In lunr, there is a way to override the tokenizer separator, though statically which is not great, but I believe this could be done per builder instance in LunrCore.
In my case, I could use a more appropriate regex separator like
[^\w]+