chriseldredge / Lucene.Net.Linq

LINQ provider to run native queries on a Lucene.Net index
Other
151 stars 66 forks source link

Incorrect results when querying by text containing a space #30

Closed mcintyre321 closed 10 years ago

mcintyre321 commented 11 years ago

I've been trying to do some var users = q.Where(u => u.Name == "Firstname Lastname"); queries, but am getting incorrect results.

I can replicate the issue by adding another test to CustomWhereTests.cs

        [Test]
        public void WhereUsingExpression()
        {
            AddDocument(new SampleDocument { Name = "Documents Bill", Id = "X.Y.1.2" });
            AddDocument(new SampleDocument { Name = "Bills Document", Id = "X.Z.1.3" });

            var documents = provider.AsQueryable<SampleDocument>();

            var result = documents.Where(d => d.Name == "Bills Document");

            //Fails, no results found
            Assert.That(result.Single().Name, Is.EqualTo("Bills Document"));

        }

I think this is because the query term ("Firstname Lastname") gets pulled into two parts by the . Apparently a TermQuery is the correct thing to generate. Is there some way to hint my model to do this? Maybe a QueryType.ExactMatch is required?

I'm also a second issue, getting extra results when querying on a q.Where(u => u.Name.StartsWith("Oli") || u.Name.Contains("Oli"). I get results coming back with names that don't have "Oli" in them, but if I search for "Oliv" it returns the correct results.

Is there something I am missing? thanks again in advance.

chriseldredge commented 11 years ago

I'm not sure what to do about this yet. The problem is that Lucene.Net's QueryParser is tokenizing the spaces. The query runs through QueryParser in ReflectionFieldMapper:187

This test fails as well, and doesn't use Lucene.Net.Linq at all:

    [Test]
    public void Wat()
    {
        var queryParser = new QueryParser(Version.LUCENE_30, "Name", new KeywordAnalyzer());

        Assert.That(queryParser.Parse("Bills Document").ToString(), Is.EqualTo("Name:Bills Document"));
    }

However, if you do:

        Assert.That(queryParser.Parse("\"Bills Document\"").ToString(), Is.EqualTo("Name:Bills Document"));

It works.

Perhaps Lucene.Net.Linq should not use the QueryParser at all in certain cases, such as when IndexMode is NotAnalyzed or when the analyzer does not tokenize text. The latter may be difficult to detect, however.

mcintyre321 commented 11 years ago

Would it make sense for the quotes to be automatically added to the string in the case of an == expression?

sent from my mobile On 28 Jun 2013 22:09, "Chris Eldredge" notifications@github.com wrote:

I'm not sure what to do about this yet. The problem is that Lucene.Net's QueryParser is tokenizing the spaces. The query runs through QueryParser in ReflectionFieldMapper:187https://github.com/themotleyfool/Lucene.Net.Linq/blob/master/source/Lucene.Net.Linq/Mapping/ReflectionFieldMapper.cs#L187

This test fails as well, and doesn't use Lucene.Net.Linq at all:

[Test]
public void Wat()
{
    var queryParser = new QueryParser(Version.LUCENE_30, "Name", new KeywordAnalyzer());

    Assert.That(queryParser.Parse("Bills Document").ToString(), Is.EqualTo("Name:Bills Document"));
}

However, if you do:

    Assert.That(queryParser.Parse("\"Bills Document\"").ToString(), Is.EqualTo("Name:Bills Document"));

It works.

Perhaps Lucene.Net.Linq should not use the QueryParser at all in certain cases, such as when IndexMode is NotAnalyzed or when the analyzer does not tokenize text. The latter may be difficult to detect, however.

— Reply to this email directly or view it on GitHubhttps://github.com/themotleyfool/Lucene.Net.Linq/issues/30#issuecomment-20214534 .

mcintyre321 commented 11 years ago

From http://lucene.apache.org/core/2_9_4/queryparsersyntax.html:

If you are programmatically generating a query string and then parsing it with the query parser then you should seriously consider building your queries directly with the query API. In other words, the query parser is designed for human-entered text, not for program-generated text.

Maybe there should be some extension method like .Search<T>(this IQueryable<T> q, ...) which uses the query parser, but the standard LINQ operators would use TermQuery objects?

mcintyre321 commented 11 years ago

I've managed to fix this on my fork by doing quoted searches for == Expressions, but the commits are entwined with my other pull request, and I don't have he git skills to submit them as a standalone pull request unfortunately. You might want to cherry pick them from my fork...

Here's a link to the commit https://github.com/mcintyre321/Lucene.Net.Linq/commit/ac05466cc1ca34fa1fe2058c8379c368177fcd1c

chriseldredge commented 11 years ago

Thanks. I have an idea for a better design but need to experiment a bit to see if I can get it to work. If not, your approach of quoting the pattern may also work. If I don't get to this today, it'll have to wait a while since I'm on vacation next week.

chriseldredge commented 11 years ago

efb4a92d86a4f60ece8797eac8d804e688031465 has a solution to this, but it isn't very elegant. I think it's good enough for now, but as more edge cases crop up the current design will probably limit options. I'll close this issue after releasing to nuget.org.

mcintyre321 commented 11 years ago

I've got some concerns (or a least a lack of knowledge) about how strings get parsed

e.g. var user = users.Where(u.Name == "\"Harry\"").Single(); //will this retuen users with the name "Harry" or "\"Harry\", using the out-of-the-box analyzer

As a library consumer, I would expect that by default, queries on ==, !=, Contains etc. would behave like a L2O query, without doing any processing of the special characters. Is that the case.

The reason I ask is that I've released my fork to a client so I want to make sure the behaviour is maintained if I switch back to the main code.

chriseldredge commented 11 years ago

The meaning of == is dependent on how the field is analyzed and indexed.

The default behavior for == and != is to escape the right-hand value (rhv) and pass it through QueryParser.Parse to build a Query. This is necessary in cases where the Analyzer would modify the text, such as by converting it to lower case or stemming (such as PorterStemFilter).

If you insert a document with Name = "Harry" and analyze the Name field with a lowercase analyzer, then query for Where(u.Name == "Harry"), without analysis you would get zero results because the rhv in the query also needs to be analyzed.

Using == and != in these cases is admittedly a leaky abstraction, but there are several escape hatches that give the client more control over how a query will be constructed. If the client wants exact match only, use a KeywordAnalyzer or another Analyzer type that does not tokenize on whitespace. If the client wishes to build Query instances, they can do so by using the Where(Query) extension method. This would give the ability to control phrase slop and other settings that are not (yet) exposed through the LINQ interface.

mcintyre321 commented 11 years ago

So I've been using the project to replace some slow L2O search code against an in memory collection. I now have an in-memory lucene index against that collection which is working very nicely. From what you are saying, KeywordAnalyzer should get me the results I am after, which is great.

My gut feeling as a library consumer is that, I expect the code to behave like L2O as much as possible, and I would add code to my classes and queries to access some of the more search-enginey-luceney features. Is that the design philosophy for the library (as opposed to the opposite)?

If so, does it make sense to have the KeywordAnalyzer be the default (rather than the IMO poorly named StandardAnalyzer), which would caulk the leaks in the abstraction a little?

By the way, I must say thanks again for the library, it's helped me fix a massive problem that had been looming over me for ages!

chriseldredge commented 11 years ago

LuceneDataProvider:87 uses KeywordAnalyzer by default.

In general L2O is the reference implementation for LINQ, but even the SQL providers differ greatly in capabilities and behavior. For example, L2O string equality is case sensitive and SQL is not.

chriseldredge commented 10 years ago

Released in v3.2.52 and published on NuGet.org.