apache / lucenenet

Apache Lucene.NET
https://lucenenet.apache.org/
Apache License 2.0
2.23k stars 639 forks source link

Lucene.net sort returns doc with NULL fields. Invalidating sort #785

Closed CrommVardek closed 1 year ago

CrommVardek commented 1 year ago

Using Lucene.Net 4.8.0-beta00016, the indexer index object with a field in Document specifically used for sorting like this :

new TextField("lastname-sort", person.LastName.RemoveDiacritics(), Field.Store.NO)

However when searching with sort using new SortField("lastname-sort", SortFieldType.STRING, false) will return some scoreDocs (the first ones) with fields to null, which is odd to me.

After researching in the doc (https://lucenenet.apache.org/docs/4.8.0-beta00008/api/Lucene.Net/Lucene.Net.Search.Sort.html?q=sort) - that is not up-to-date, because using new Field is deprecated and using Field.Index.NOT_ANALYZED is also deprecated.

Note that if the fields of the ScoreDocs hits are not null, they are correctly sorted. It is always the same Documents that have fields to NULL. So if they are part of the result, the sorted result is invalid.

Note, all fields (Field.Store.YES) are stored except for two used for sorting.

CrommVardek commented 1 year ago

The null fields were returned because of this analyzer used to index and search :

private readonly StandardAnalyzer _standardAnalyzer = new(AppLuceneVersion);

Meaning some lastname (Will, To, etc.) were considered stopWords. fields are therefore set to NULL and the sort would sort them like the field was empty (null).

Modifying to private readonly StandardAnalyzer _standardAnalyzer = new(AppLuceneVersion, CharArraySet.EMPTY_SET); solved the issue.

IMHO, stops words set should be opt-in, and constructor that does not have the CharArraySet's parmater should default to CharArraySet.EMPTY_SET not the ENGLISH_STOP_WORDS_SET. Mostly because : Not everyone is building search engines for english only. Not everyone is build search engines for common words only. Using default ENGLISH_STOP_WORDS_SET is assuming a certain business case / usage of the search engine... But really it should not interfer with the context of usage.

laimis commented 1 year ago

What you need to do here is use StringField, and not a TextField when adding lastname-sort field to the index. This will ensure that the field is not analyzed when it's being stored.

NightOwl888 commented 1 year ago

I am closing this because it is not a bug that needs addressing in Lucene.NET. Please direct usability questions to StackOverflow or the Lucene.NET user mailing list.