bleroy / lunr-core

A port of LUNR.js to .NET Core
MIT License
558 stars 24 forks source link

Error: Out of order word insertion. #2

Closed xoofx closed 4 years ago

xoofx commented 4 years ago

Hey,

So I have been trying to use the index construction from C# and use it with latest Lunr 2.3.8. I get the index serialized to JSON but when I try to initialize lunr.js with the data, I get the following error:

Error: Out of order word insertion.

The error is not particularly helpful, but modifying lunr and I got a bit more details:

Error: Out of order word insertion. The word `0` must be < `_` but is not

The inverted index in lunr must be ordered, and mine is starting like this:

    "invertedIndex": [
        [
            "_",
            {
                "_index": 417,
                "id": {},
                "title": {},
                "body": {
                    "/doc/user/syntax/": {}
                }
            }
        ],
        [
            "0",
            {
                "_index": 457,
                "id": {},
                "title": {},
                "body": {
                    "/doc/user/syntax/": {}
                }
            }
        ],

From a string point of view, it is correct, but from a JavaScript point of view, the mess is that "0" < "_" returns true... so I believe that lunr-core here is maybe indexing stuffs that it should not (the number 0 or even _)

Sorry If I can't put the json file here as it's still a private project that I can't release yet 😉

Let me know if you need more details!

xoofx commented 4 years ago

Also I can see that in the indexing words that contains / (e.g "doc/advanc") or ( (e.g "bin(out",) are matched, which seems wrong (maybe that's also done in lunr, but at least this can be fixed more easily here)

bleroy commented 4 years ago

Thanks, I'll fix those. Can you open a second issue?

bleroy commented 4 years ago

Indexing numbers (and terms with them, or with underscores) actually makes sense to me, I can see how that can be useful for some types of documents, so I'm not going to ignore them. Instead, I'll reproduce the quirky comparison behavior of JS when serializing.

bleroy commented 4 years ago

Turns out it's not ECMA that's weird in this instance, it's .NET. I think ordinal comparison should fix it. Verifying now.

xoofx commented 4 years ago

Indexing numbers (and terms with them, or with underscores) actually makes sense to me, I can see how that can be useful for some types of documents, so I'm not going to ignore them. Instead, I'll reproduce the quirky comparison behavior of JS when serializing.

Indeed. Good idea to fix it via the comparison.

It should handle Javascript numbers at large as they are parsed implicitly also via string (so 0.1 or 1e10 as well) but also underscore in numbers.

bleroy commented 4 years ago

AAAKCHUALY I checked the EcmaScript spec and .NET behavior to get to the bottom of this. The problem is not coercion of strings to numbers. JS does no such thing when comparing two strings. The problem is strictly that the default comparison in .NET is not ordinal like in JS (check if on string is a prefix of the other, then detect where they differ, then compare the UTF16 codes of the chars at that position). Note that from a char code PoV, '0'<'_', but .NET by default does a comparison that is not based on char codes but on linguistic rules of the current culture. So not only was the behavior wrong, but it was non-deterministic based on the thread's culture. Definitely a bug independently of lunr.js compatibility considerations. I'm sending a PR now with ordinal comparison, and the behavior is now consistent.

bleroy commented 4 years ago

The fix is also packaged as https://www.nuget.org/packages/LunrCore/2.3.8.1