Open MrCreosote opened 11 months ago
Hello.
query 2 to fail (3 trigrams in the query, the longest matching sequence is 1 trigram in the target for a score of 0.33)
It's not correct, longest matching sequence is also 3 here xxobaxxxbacxxxactxxx
In general you can imagine it like
"obact" -> target ngram array [oba bac act] "xxobaxxxbacxxxactxxx" -> ngram array [xxo xob oba bax ....]
And we just find LCP longest common subsequence, when score is lcp divide by target array length
If you have long tokens (which is separated to ngrams) probably will be good is increase ngram size
Thanks for clearing up my understanding.
It's not correct, longest matching sequence is also 3 here xxobaxxxbacxxxactxxx
Hmm, that's not what I would think of as a sequence. I would think they'd need to be contiguous. I understand that things are working as intended though, so at most we're talking about a terminology difference.
That being said - is there a way to use arangosearch to find an arbitrary length substring in a field without scanning every document value for the field? There's LIKE
, which the documentation says is backed by indexes, but for a search like %foo%
to use an index you'd need a suffix tree or something like that, not just a regular inverted index (right?). My naive assumption is that inverted indexes will only accelerate LIKE
searches where the left side of the query is anchored, like foo%
.
There's LIKE, which the documentation says is backed by indexes, but for a search like %foo% to use an index you'd need a suffix tree or something like that, not just a regular inverted index (right?). My naive assumption is that inverted indexes will only accelerate LIKE searches where the left side of the query is anchored, like foo%.
Yep, you're completely right, but we plan to add wildcard analyzer in 3.12. That helps for leading wildcard query too (internally it uses ngram and post filtering)
As another option you can try to make ngram with different size and search your substring by exact term search. But it can make inverted index big ofc
we plan to add wildcard analyzer in 3.12
Oh great, looking forward to 3.12 then. Hurry hurry hurry!
My Environment
arangodb:3.11
Docker image from DockerhubComponent, Query & Data
Affected feature: ArangoSearch query using web interface
AQL query (if applicable): Query 1:
Query 2:
AQL explain and/or profile (if applicable): N/A
Dataset: N/A
Size of your Dataset on disk: N/A
Replication Factor & Number of Shards (Cluster only): RF: 3 Shards: N/A
Steps to reproduce
true
resulttrue
resultProblem: The documentation describes the scoring of the ngram match as:
Based on that description I would expect query 1 to match (3 trigrams in the query, a sequence of 3 matching trigrams in the target for a score of 1) and query 2 to fail (3 trigrams in the query, the longest matching sequence is 1 trigram in the target for a score of 0.33), but both match.
I was hoping to use
NGRAM_MATCH
as a fast (and it's definitely really fast) substring query but given the current behavior that's not going to work, unless I'm doing something wrong.Expected result: Query 2's result should be
false
if I understand the documentation correctlyThanks for any advice you can give me if there's something incorrect with my setup