NGRAM_MATCH behavior does not appear to match documentation and makes it less useful for substring search

MrCreosote commented 11 months ago

My Environment

ArangoDB Version: 3.11.4
Deployment Mode: Single server from arangodb:3.11 Docker image from Dockerhub
Deployment Strategy: whatever the docker image does
Configuration: whatever the docker image does
Infrastructure: laptop
Operating System: Ubuntu 18.04
Total RAM in your machine: 32GB, 20GB for VirtualBox VM
Disks in use: SSD
Used Package: Docker image

Component, Query & Data

Affected feature: ArangoSearch query using web interface

AQL query (if applicable): Query 1:

RETURN NGRAM_MATCH("flaviobacter", "obact", 1, "ngram3")

Query 2:

RETURN NGRAM_MATCH("xxobaxxxbacxxxactxxx", "obact", 1, "ngram3")

AQL explain and/or profile (if applicable): N/A

Dataset: N/A

Size of your Dataset on disk: N/A

Replication Factor & Number of Shards (Cluster only): RF: 3 Shards: N/A

Steps to reproduce

Add the following analyzer to the DB:

{
"name": "test::ngram3",
"type": "pipeline",
"features": [
    "frequency",
    "position"
],
"properties": {
    "pipeline": [
        {
            "type": "norm",
            "properties": {
                "locale": "en",
                "case": "lower",
                "accent": false
            }
        },
        {
            "type": "ngram",
            "properties": {
                "min": 3,
                "max": 3,
                "preserveOriginal": false,
                "streamType": "utf8",
                "startMarker": "",
                "endMarker": ""
            }
        }
    ]
}
}

Execute query 1, note the true result
Execute query 2, note the true result

Problem: The documentation describes the scoring of the ngram match as:

The similarity is calculated by counting how long the longest sequence of matching n-grams is, divided by the target’s total n-gram count. Only fully matching n-grams are counted. [Emphasis mine]

Based on that description I would expect query 1 to match (3 trigrams in the query, a sequence of 3 matching trigrams in the target for a score of 1) and query 2 to fail (3 trigrams in the query, the longest matching sequence is 1 trigram in the target for a score of 0.33), but both match.

I was hoping to use NGRAM_MATCH as a fast (and it's definitely really fast) substring query but given the current behavior that's not going to work, unless I'm doing something wrong.

Expected result: Query 2's result should be false if I understand the documentation correctly

Thanks for any advice you can give me if there's something incorrect with my setup

MBkkt commented 10 months ago

Hello.

query 2 to fail (3 trigrams in the query, the longest matching sequence is 1 trigram in the target for a score of 0.33)

It's not correct, longest matching sequence is also 3 here xxobaxxxbacxxxactxxx

MBkkt commented 10 months ago

In general you can imagine it like

"obact" -> target ngram array [oba bac act] "xxobaxxxbacxxxactxxx" -> ngram array [xxo xob oba bax ....]

And we just find LCP longest common subsequence, when score is lcp divide by target array length

MBkkt commented 10 months ago

If you have long tokens (which is separated to ngrams) probably will be good is increase ngram size

MrCreosote commented 10 months ago

Thanks for clearing up my understanding.

It's not correct, longest matching sequence is also 3 here xxobaxxxbacxxxactxxx

Hmm, that's not what I would think of as a sequence. I would think they'd need to be contiguous. I understand that things are working as intended though, so at most we're talking about a terminology difference.

That being said - is there a way to use arangosearch to find an arbitrary length substring in a field without scanning every document value for the field? There's LIKE, which the documentation says is backed by indexes, but for a search like %foo% to use an index you'd need a suffix tree or something like that, not just a regular inverted index (right?). My naive assumption is that inverted indexes will only accelerate LIKE searches where the left side of the query is anchored, like foo%.

MBkkt commented 10 months ago

There's LIKE, which the documentation says is backed by indexes, but for a search like %foo% to use an index you'd need a suffix tree or something like that, not just a regular inverted index (right?). My naive assumption is that inverted indexes will only accelerate LIKE searches where the left side of the query is anchored, like foo%.

Yep, you're completely right, but we plan to add wildcard analyzer in 3.12. That helps for leading wildcard query too (internally it uses ngram and post filtering)

MBkkt commented 10 months ago

As another option you can try to make ngram with different size and search your substring by exact term search. But it can make inverted index big ofc

MrCreosote commented 10 months ago

we plan to add wildcard analyzer in 3.12

Oh great, looking forward to 3.12 then. Hurry hurry hurry!

arangodb / arangodb