SDKits / ExamineX

Issue tracker for ExamineX
https://examinex.online
5 stars 0 forks source link

Examinex escape incorrectly value for -1 #100

Open bielu opened 5 months ago

bielu commented 5 months ago

Hi @Shazwazza , When working on project where examinex was requested by client, I found similar bug to https://github.com/Novicell/Novicell.Examine.ElasticSearch/issues/17 When passing my field value as -1, it is the lucene query retrieve by examinex: +(+xIndexType:content) +xPublished:y -hideFromSearch:1 +spath:1 which cause incorrect results as my spath has indexed -1! In my elastic search provider i resolved it by escaping negative ints, not sure if that is correct approach for this however.

Edit: Forgot mentioned versions: Umbraco 13.2.2 Examinex 6.0.0-beta.2

Shazwazza commented 5 months ago

The - is a reserved character in Lucene so the lucene parser will remove it unless you escape it for searching with prefixed backslashes.

Can you please advise on what field type you are searching on and what query you are using to search along with what data is in that field? I need this to try to replicate.

There are several unit tests in ExamineX for testing paths with -1 so need to know specifically what and how you are searching.

Thanks!

bielu commented 5 months ago

It is example what I do with query: image and there is field type: image and there is how it is index: image it is interesting that it works with normal examine on lucene and generate correctly -1 😂 but as soon it comes here it doesnt

Shazwazza commented 5 months ago

So spath is being indexed as a normal full text field type with a standard analyzer, so all stop words and punctuation, etc... are stripped out - is this what you are expecting?

Similarly, because the field type is full text and the analyzer is standard for that field, when it parses the query, it will strip off reserved words and punctuation (like the minus symbol). If you want to retain the minus symbol in your search query then you would need to escape the value, like request.ListingId.ToString().Escape()

But I don't know what the value for ListingId is, can you please specify? Is it always a specific number or is it a path value, etc... ? Perhaps standard analyzer isn't what you should be using on this field, it really depends on what this field is.

bielu commented 5 months ago

Listing id is always int, so values would be spath can be index as keyword, but it still not explain why switching to azure search makes difference between how query looks 🤔 i mean here, based on docs, the indexing and querying should be 1:1 with lucene examine. So i would expect both to return same lucene query :)

Shazwazza commented 5 months ago

I am assuming that the ToString() for your query, regardless of whether ExamineX is enabled or not will result in this:

+(+x__IndexType:content) +x__Published:y -hideFromSearch:1 +spath:1 ?

With the - character stripped out? It is the same query parser used in both so I would expect the query to be generated the same way.

I can help in investigating why this would work with Lucene as opposed to Azure Search if I can replicate.

Based on what your requirements are, I would suggest trying to escape the value passed to the query (i.e. request.ListingId.ToString().Escape() and see if that works in both engines, else I would think whitespace analyzer would work well. Another alternative, would be to use a keyword analyzer (i.e. Raw examine field) which is what the normal Umbraco Path field is indexed as and then use path queries that are escaped (that is ... if this field is actually like an Umbraco path).

bielu commented 5 months ago

@Shazwazza yes I didnt change anything around the code of this, when installing examinex, but I will need play around little mroe to figure out little more or change analyzer :)

Shazwazza commented 5 months ago

What is the exact value that is being sent to the index for spath? I see that it ends up as "-1 -20 1250" in the Azure Search index, but what is the value being sent there? Is it "-1,-20,1250"?

bielu commented 5 months ago

the exact submited value is "-1 -20 1250", what I need check if it is just a string or collection as I was sure I am submitting collection 🤔

Shazwazza commented 5 months ago

I put a couple of rudimentary tests together and the results are the same between Examine and ExamineX with Azure Search:

Examine Test:

[Test]
public void Can_Search_On_Negative_Numbers_With_Standard_Analyzer()
{
    var analyzer = new StandardAnalyzer(LuceneInfo.CurrentVersion);
    using (var luceneDir = new RandomIdRAMDirectory())
    using (var indexer = GetTestIndex(luceneDir, analyzer))
    {
        indexer.IndexItems(new[] {
            ValueSet.FromObject(1.ToString(), "content",
                new { spath = "-1 -20 1250" }),
            ValueSet.FromObject(2.ToString(), "content",
                new { spath = "-1 -20 1251" }),
            ValueSet.FromObject(3.ToString(), "content",
                new { spath = "-1 -20 1252" }),
            ValueSet.FromObject(4.ToString(), "content",
                new { spath = "-1 -20 1253" }),
        });

        var searcher = (BaseLuceneSearcher)indexer.Searcher;

        var query = searcher
            .CreateQuery(IndexTypes.Content)
            .Field("spath", "-1");

        Console.WriteLine(query);
        var results = query.Execute();

        Assert.AreEqual(4, results.TotalItemCount);
    }
}

ExamineX Test:

[Test]
public void Can_Search_On_Negative_Numbers_With_Standard_Analyzer()
{
    using (AzureIndex.ProcessNonAsync())
    {
        AzureIndex.IndexItems(new[] {
            ValueSet.FromObject(NewId(), "content",
                new { spath = "-1 -20 1250" }),
            ValueSet.FromObject(NewId(), "content",
                new { spath = "-1 -20 1251" }),
            ValueSet.FromObject(NewId(), "content",
                new { spath = "-1 -20 1252" }),
            ValueSet.FromObject(NewId(), "content",
                new { spath = "-1 -20 1253" }),
        });
    }

    var s = GetDelayedSearcher();

    var query = s
        .CreateQuery(IndexTypes.Content)
        .Field("spath", "-1");

    Console.WriteLine(query);
    var results = query.Execute();

    Assert.AreEqual(4, results.TotalItemCount);
}

In both cases, the query is parsed exactly the same: { Category: content, LuceneQuery: +spath:1 }

which is expected because the field is using StandardAnalyzer in both scenarios.

I'm not sure if this helps at all, but might help show what is different between these tests vs how your query/index is put together.

bielu commented 4 months ago

I debugged it and it is weird one, but found workaround so for now. It seems like for some reason it works when you index values as string, but not when you index them as collection 🤔

Shazwazza commented 4 months ago

Ok thanks, I will try to replicate indexing the values that way.