Shazwazza / Examine

A .NET indexing and search engine powered by Lucene.Net
https://shazwazza.github.io/Examine/
375 stars 123 forks source link

Examine Facets proposal #310

Closed nikcio closed 6 months ago

nikcio commented 1 year ago

Examine Facets proposal

Linked PR #311
Linked PR #312
Linked PR #313

What is faceted search?

Faceted search is a technique that involves augmenting traditional search techniques with a faceted navigation system, allowing users to narrow down search results by applying multiple filters based on faceted classification of the items. It is sometimes referred to as a parametric search technique. A faceted classification system classifies each information element along multiple explicit dimensions, called facets, enabling the classifications to be accessed and ordered in multiple ways rather than in a single, pre-determined, taxonomic order. (Source)

Description

This proposal is on the implementation of faceted search in Examine. The proposal is mostly based on finding reasonable interface abstractions the best approach for building the feature and less on specific implementation details.

Previous implementation

Facets are available for Examine when targeting .NET framework via the Examine.Facets package by Callum Whyte. This proposal is based on the implementation of that package.

Motivation

I'm currently working on a project which would have a great use for Examine facets.

Approach

1. Externalize the Faceting features

The first approach is to externalize the Faceting features to a separate Nuget package in the same way Examine.Facets works a POC of this approach can be seen here: POC: Examine.Facets

2. Internalize the Faceting features (I think this is the best approach)

The second approach is to make faceting available directly in the existing Examine package and existing classes this would make it possible to avoid creating the following implementations and instead add the features to the existing classes:

This approach, therefore, allows the default searcher to do a faceted search and this will lower the barrier to entry because a developer wouldn't have to register a separate searcher and explicitly use this searcher for faceted searching. As seen in this example:

Example 1 (From the existing Examine.Facets package by Callum)

// Setup
if (_examineManager.TryGetIndex("CustomIndex", out IIndex index))
{
    if (index is LuceneIndex luceneIndex)
    {
        var searcher = new FacetSearcher(
            "FacetSearcher",
            luceneIndex.GetIndexWriter(),
            luceneIndex.DefaultAnalyzer,
            luceneIndex.FieldValueTypeCollection
        );

        _examineManager.AddSearcher(searcher);
    }
}

// Fetching a searcher
_examineManager.TryGetSearcher("FacetSearcher", out ISearcher searcher);

Example 2 (From a test in the POC)

TrackingIndexWriter writer = indexer.IndexWriter;
var searcherManager = new SearcherManager(writer.IndexWriter, true, new SearcherFactory());
var searcher = new FacetSearcher(nameof(FacetSearcher), searcherManager, analyzer, indexer.FieldValueTypeCollection);

Example source

Structure

Bases

Searching

public interface IFacetField
{
    /// <summary>
    /// The field name
    /// </summary>
    string Field { get; }

    /// <summary>
    /// The field to get the facet field from
    /// </summary>
    string FacetField { get; set; }
}

Searching results

public interface IFacetValue
{
    /// <summary>
    /// The label of the facet value
    /// </summary>
    string Label { get; }

    /// <summary>
    /// The occurrence of a facet field
    /// </summary>
    float Value { get; }
}
public interface IFacetResult : IEnumerable<IFacetValue>
{
    /// <summary>
    /// Gets the facet for a label
    /// </summary>
    /// <param name="label"></param>
    /// <returns></returns>
    IFacetValue Facet(string label);
}

Example facet result

Tags:

Software (121)
People (20)
Packages (2)
public interface IFacetResults
{
    /// <summary>
    /// Facets from the search
    /// </summary>
    IDictionary<string, IFacetResult> Facets { get; }
}

Extensions

/// <summary>
/// Get the values for a particular facet in the results
/// </summary>
public static IFacetResult GetFacet(this ISearchResults searchResults, string field)
{
    // Implementation
}
/// <summary>
/// Get all of the facets in the results
/// </summary>
public static IEnumerable<IFacetResult> GetFacets(this ISearchResults searchResults)
{
    // Implementation
}

Types of facets

Sources of information about Lucene's facet search are:

String Facet

Allows for counting the documents that share the same string value.

New FieldDefinitionTypes:

Extends the existing FullText and FullTextSortable type and adds the required SortedSetDocValuesFacetField to the indexed document. Without this field, SortedSetDocValuesFacetCounts will not work.

New query methods

On IQuery

/// <summary>
/// Add a facet string to the current query
/// </summary>
IFacetQueryField Facet(string field);

/// <summary>
/// Add a facet string to the current query, filtered by value
/// </summary>
IFacetQueryField Facet(string field, string value);

/// <summary>
/// Add a facet string to the current query, filtered by multiple values
/// </summary>
IFacetQueryField Facet(string field, string[] values);
public interface IFacetQueryField : IBooleanOperation
{
    /// <summary>
    /// Maximum number of terms to return
    /// </summary>
    IFacetQueryField MaxCount(int count);

    /// <summary>
    /// Sets the field where the facet information will be read from
    /// </summary>
    IFacetQueryField FacetField(string fieldName)
}

New IFacetField

public interface IFacetFullTextField : IFacetField
{
    /// <summary>
    /// Maximum number of terms to return
    /// </summary>
    int MaxCount { get; set; }

    /// <summary>
    /// Filter values
    /// </summary>
    string[] Values { get; set; }
}

Facets config / New index methods - Optional addition. Properly not the most used feature

FacetsConfig allows for setting some values in the index which are useful for faceting API docs

On LuceneIndexOptions

public FacetsConfig FacetConfig { get; set; }

This will make it possible to set the facet configuration on the specific index and reuse it when searching.

Methods to change the field used when reading facets (default is $facets which is where all facet values are indexed if FacetsConfig.SetIndexFieldName(dimName, indexFieldName) is not called.):

See IFacetQueryField (It's not possible to specify the reading field in range facets)

This will make it possible to set the faceting field per facet field giving the most flexibility when composing a query,

Note: The FacetConfig will also need to be available at search time in the searchExecutor to be used in the constructor when using Taxonomy

Numeric Range Facet

Used with numbers to build range facets. For example, it would group documents of the same price range.

Double Range

New FieldDefinitionTypes:

Extends the existing Double and Float type and adds the required DoubleDocValuesField and SingleDocValuesField respectively, aswell as the SortedSetDocValuesFacetField to enable string like faceting, to the indexed document. Without the fields, DoubleDocValuesField and SingleDocValuesField faceting will not work.

New query methods

On IQuery

/// <summary>
/// Add a range facet to the current query
/// </summary>
IFacetRangeQueryField Facet(string field, DoubleRange[] doubleRanges);
public interface IFacetDoubleRangeQueryField : IBooleanOperation
{
    /// <summary>
    /// Sets if the range query is on <see cref="float"/> values
    /// </summary>
    /// <param name="isFloat"></param>
    /// <returns></returns>
    IFacetDoubleRangeQueryField IsFloat(bool isFloat);
}

New IFacetField

public interface IFacetDoubleField : IFacetField
{
    DoubleRange[] DoubleRanges { get; set; }
}

Long Range / Numeric range

New FieldDefinitionTypes:

Extends the existing types and adds the required NumericDocValuesField, aswell as the SortedSetDocValuesFacetField to enable string like faceting, to the indexed document. Without the fields, NumericDocValuesField faceting will not work.

New query methods

On IQuery

/// <summary>
/// Add a range facet to the current query
/// </summary>
IFacetRangeQueryField Facet(string field, Int64Range[] longRanges);
public interface IFacetLongRangeQueryField : IBooleanOperation
{
}

New IFacetField

public interface IFacetLongField : IFacetField
{
    Int64Range[] LongRanges { get; set; }
}

Taxonomy Facet

Doing Taxonomy requires using a speciffic writer (DirectoryTaxonomyWriter) and is therefore out of the scope of this proposal.

See more at: https://norconex.com/facets-with-lucene/


What now

bergmania commented 1 year ago

Wow @nikcio 🤩

Good work on this proposal :)

From a user point of view, I also like the approach 2 the most, but it requires more of the implementations.

Sorry for my lack of knowledge, but in what case will IFacetValue.Value not be an integer/long?

nikcio commented 1 year ago

@bergmania in Apache Lucene Faceted Search User's Guide there's written the following under 2.2 Facet Associations

So far we've discussed categories as binary features, where a document either belongs to a category, or not.

While counts are useful in most situations, they are sometimes not sufficiently informative for the user, with respect to deciding which subcategory is more important to display.

For this, the facets package allows to associate a value with a category. The search time interpretation of the associated value is application dependent. For example, a possible interpretation is as a match level (e.g., confidence level). This value can then be used so that a document that is very weakly associated with a certain category will only contribute little to this category's aggregated weight.

So it's possible to use this feature to get floating values. I'm not quite sure myself excatly how this is configured and used so I mostly just based the value type off the FacetResult type from the Lucene.Net.Facet package because it's the output of the different facet readers in Lucene.NET.

nikcio commented 1 year ago

@Shazwazza I've created some PR's that could implement this proposal and add some great functionality / Documentation to Examine. Please let me know if I can do anything to help the PR's along.

PRs:

311 (Facet implementation)

312 (XML docs on facets and in the project where missing)

313 (Nullable feature for project and facets feature)

Shazwazza commented 1 year ago

@nikcio Sorry to keep you waiting on this one, I have all of these starred in my inbox and will get to them soon, just a bit swamped this week.

nzdev commented 1 year ago

Added support for efficient deep paging (SearchAfter) for faceted and non faceted search https://github.com/Shazwazza/Examine/pull/321.

Shazwazza commented 1 year ago

@nikcio I started having a look through all this yesterday and so far all I can say is what an amazing job you've done so far. I'll keep reviewing over this week and we can determine if there are any tweaks necessary. @nzdev also thanks a ton for your recent PRs and help. Hopefully can get this all merged in for xmas/NY and get what will most likely be a new major release out.

nzdev commented 1 year ago

WIP https://github.com/nzdev/Examine/tree/v3/feature/facet-taxonomy Facet Taxonomy Index support. Needed for Hierarchical facets. Also is something like 20%-25% faster according to Lucene.Net docs.

dealloc commented 1 year ago

Has there been an update on this?

nikcio commented 1 year ago

@dealloc I believe #311 is very close to being done. Just waiting for the stars to align😅🤞

(I'm a little unsure what we are waiting on to be honest 😬😅)

nikcio commented 1 year ago

Status:

Here is how I see the status of the Examine repo. What do you think @Shazwazza and @nzdev?

Merged PRs (Release/4.0)

Needs to be merged from the Release/3.0 branch - This is done in #345

Still needs to be merged into Release/4.0

Still needs to be done

[This list is based on #345 being merged into release/4.0]

The massive amount of work that has been going on have created some warnings in the project where some are more important than others. Here is a run down of the ones I think we should look at before a stable 4.0 release.

Other PRs that could properly come in another future release:

Stale PRs?

Preview/Alpha/Beta release

I think the best way to get some kind of feel of what still needs to be done is only possible by making an early release and hear around the community for people to test it out. - This can be done when the current PRs in the "Still needs to be merged into Release/4.0" and #345 are merged in. @Shazwazza

nzdev commented 1 year ago

I'm wondering if the facetconfig class could be abstracted away by setting hierarchy/ multi facets on the index fields instead

nikcio commented 1 year ago

Just to keep this thread continuous also see https://github.com/Shazwazza/Examine/pull/345#issuecomment-1656701699 (From @nzdev )

Here's what I'm thinking. Have the next release of Examine be 4.0, but avoid breaking API changes for 3.x. This means Umbraco 10 and 12 can choose to relax the allowed Examine version to be v3 or v4.

Steps:

  1. Merge PR Record v3 shipped API using Microsoft.CodeAnalysis.PublicApiAnalyzers #346 which tracks the shipped API for V3.
  2. Merge PR Fix compatibility with V3 API #347 which merges V3 into V4 and fixes any API compatibility issues.
  3. Rebase Merges the changes from the release/3.0 branch to release/4.0 branch #345 to fix any nullability / xml docs issues.
  4. Add the new api's as unshipped to the txt files and release a beta
  5. Allow time for feedback, resolve feedback, add new api to the shipped.txt files. (regen the files to include tracking API nullability annotations. This is due to v3 not making nullability claims)
  6. Release 4.0
nzdev commented 1 year ago

https://github.com/Shazwazza/Examine/pull/347 supersedes https://github.com/Shazwazza/Examine/pull/339

bjarnef commented 1 year ago

If possible I think it would be great if the support for Spatial API https://github.com/Shazwazza/Examine/pull/328 is included in v4 as well.

Something we could have used in a recent project, is faceted search, but where one of the facets is search on items within a distance, e.g. 10, 20, .. or 100 km. It that case I guess facets would be combined with spatial search.

In this specific project we used something like this to combine in with the existing (filtered) query.

public LuceneSearchResults SearchByDistance(Query query, Coordinate coordinate, int distanceInKm, QueryOptions? options = null)
{
    if (Index is not LuceneIndex luceneIndex)
        throw new InvalidOperationException($"Index {Index.Name} is not a LuceneIndex");

    int maxLevels = 11;

    // Create an SpatialStrategy
    var ctx = SpatialContext.Geo;
    var strategy = new RecursivePrefixTreeStrategy(
                    new GeohashPrefixTree(ctx, maxLevels),
                    fieldName: Constants.Examine.CourseInstance.FieldNames.GeoLocation);

    var lat = coordinate.Latitude;
    var lng = coordinate.Longitude;

    var results = DoSpatialSearch(ctx, strategy, luceneIndex, query, distanceInKm, lat, lng, options ?? QueryOptions.Default);

    return results;
}

private static LuceneSearchResults DoSpatialSearch(
            SpatialContext ctx, SpatialStrategy strategy,
            LuceneIndex index, Query q, double distanceInKm, double lat, double lng,
            QueryOptions options)
  {
      var searcher = (LuceneSearcher)index.Searcher;
      var searchContext = searcher.GetSearchContext();

      using ISearcherReference searchRef = searchContext.GetSearcher();

      var indexSearcher = searchRef.IndexSearcher;

      GetXYFromCoords(lat, lng, out var x, out var y);

      var distance = DistanceUtils.Dist2Degrees(distanceInKm, DistanceUtils.EarthMeanRadiusKilometers);

      // Make a circle around the search point
      var shape = ctx.MakeCircle(x, y, distance);
      var args = new SpatialArgs(
                  SpatialOperation.Intersects, shape);

      // Create the Lucene Filter
      var filter = strategy.MakeFilter(args);

      // Create the Lucene Query
      var query = strategy.MakeQuery(args);

      var startingPoint = ctx.MakePoint(x, y);
      var valueSource = strategy.MakeDistanceValueSource(startingPoint);

      var sortByDistance = new Sort(valueSource.GetSortField(false)).Rewrite(indexSearcher);

      ValueSourceFilter vsf = new ValueSourceFilter(new QueryWrapperFilter(query), valueSource, 0, distance);
      var filteredSpatial = new FilteredQuery(new MatchAllDocsQuery(), vsf);
      var spatialRankingQuery = new FunctionQuery(valueSource);

      IList<BooleanClause> existingClauses = ((BooleanQuery)q).GetClauses();

      BooleanQuery bq = new()
      {
          { filteredSpatial, Occur.MUST },
          { spatialRankingQuery, Occur.MUST }
      };

      var includesStartDate = existingClauses.Where(x => x.Query.ToString().Contains("startDate")).Any();
      foreach (var c in existingClauses)
      {
          var queryString = c.Query.ToString();
          if (queryString.Contains("latestValidDate"))
          {
              if (!includesStartDate)
              {
                  bq.Add(GetRangeQuery(queryString), Occur.MUST);
                  continue;
              }
              else
              {
                  continue;
              }
          }
          if (queryString.Contains("startDate"))
          {
              bq.Add(GetRangeQuery(queryString), Occur.MUST);
              continue;
          }

          bq.Add(c);
      }

      int maxDoc = indexSearcher.IndexReader.MaxDoc;

      var maxResults = Math.Min((options.Skip + 1) * options.Take, maxDoc);
      maxResults = maxResults >= 1 ? maxResults : QueryOptions.DefaultMaxResults;

      ICollector topDocsCollector = TopFieldCollector.Create(sortByDistance, maxResults, false, false, false, false);

      indexSearcher.Search(bq, filter, topDocsCollector);

      TopDocs topDocs = ((TopFieldCollector)topDocsCollector).GetTopDocs(options.Skip, options.Take);

      var totalItemCount = topDocs.TotalHits;

      var results = new List<ISearchResult>();
      for (int i = 0; i < topDocs.ScoreDocs.Length; i++)
      {
          var result = GetSearchResult(i, topDocs, indexSearcher);
          results.Add(result);
      }

      return new LuceneSearchResults(results, totalItemCount);
  }
nzdev commented 1 year ago

I think for now it would make sense to merge the pr that helps with V3 compatibility and then release a v4 as it's already a big change. After that it's possible to introduce another v4.x release with the other APIs for filtering, function queries, facet drill down and spatial.

bjarnef commented 1 year ago

Would it be possible to have a Beta or RC build of a release with Facets feature of v3/v4?

nzdev commented 1 year ago

Remaining tasks

  1. Merge https://github.com/Shazwazza/Examine/pull/349
  2. Release a beta
  3. Allow time for feedback, resolve feedback, add new api to the shipped.txt files (Cut from unshipped and add to shipped)
  4. Release 4.0
Shazwazza commented 1 year ago

@nzdev + @nikcio the build for a potential beta is here https://github.com/Shazwazza/Examine/actions/runs/6165490399

If anyone has time, the artifacts have the created Nuget package, would be awesome if someone could test consuming that locally before I publish it to nuget.org?

nzdev commented 1 year ago

Works for me

nikcio commented 1 year ago

@Shazwazza Let's get the beast out there. I don't have time myself right now to test it but if it works for @nzdev that should be good enough to release the beta 🚀

nzdev commented 1 year ago

Hi @Shazwazza . Can you please publish the beta to Nuget. Thanks

Shazwazza commented 1 year ago

Yeah for sure, sorry it's been a hectic month 😕 will get it out tomorrow. Thanks so much for pushing this along and all your support.

bjarnef commented 12 months ago

@Shazwazza any update on this? 😊 we would love to test this further and we have potential projects where facets would be useful, both in terms of commerce or regular Umbraco content.

Shazwazza commented 11 months ago

Just getting betas out now, just pushed 3.2.0-beta https://github.com/Shazwazza/Examine/releases/tag/v3.2.0-beta.9

Shazwazza commented 11 months ago

And this one out now too https://github.com/Shazwazza/Examine/releases/tag/v4.0.0-beta.1

Shazwazza commented 11 months ago

I'm just trying to get the docfx build running against the release/v4.0 branch but it is failing which I think is due to having attributes on things that cannot be inherited, but we have a lot of so its a bit hard to go through them all. I've found a few that cannot inherit so will keep at it. I didn't want to Tweet the releases until the docs were up.

Shazwazza commented 11 months ago

Keeps failing with

[23-10-27 10:21:17.275]Error:Error extracting metadata for /github/workspace/src/Examine.Lucene/Examine.Lucene.csproj,/github/workspace/src/Examine.Core/Examine.Core.csproj,/github/workspace/src/Examine.Host/Examine.csproj: System.NullReferenceException: Object reference not set to an instance of an object
  at Microsoft.DocAsCode.Metadata.ManagedReference.CopyInherited.InheritDoc (Microsoft.DocAsCode.Metadata.ManagedReference.MetadataItem dest, Microsoft.DocAsCode.Metadata.ManagedReference.ResolverContext context) [0x0007f] in <a8c39[85](https://github.com/Shazwazza/Examine/actions/runs/6672776075/job/18137309467#step:6:86)37c454be982e8eaec7eb97dd4>:0 
  at Microsoft.DocAsCode.Metadata.ManagedReference.CopyInherited+<>c__DisplayClass0_0.<Run>b__1 (Microsoft.DocAsCode.Metadata.ManagedReference.MetadataItem current, Microsoft.DocAsCode.Metadata.ManagedReference.MetadataItem parent) [0x00008] in <a8c398537c454be982e8eaec7eb97dd4>:0 
  at Microsoft.DocAsCode.Common.TreeIterator.Preorder[T] (T current, T parent, System.Func`2[T,TResult] childrenGetter, System.Func`3[T1,T2,TResult] action) [0x0000c] in <f27dcd834d6d4f32ac0a576c1732f2f1>:0 
  at Microsoft.DocAsCode.Common.TreeIterator.Preorder[T] (T current, T parent, System.Func`2[T,TResult] childrenGetter, System.Func`3[T1,T2,TResult] action) [0x00036] in <f27dcd834d6d4f32ac0a576c1732f2f1>:0 
  at Microsoft.DocAsCode.Common.TreeIterator.Preorder[T] (T current, T parent, System.Func`2[T,TResult] childrenGetter, System.Func`3[T1,T2,TResult] action) [0x00036] in <f27dcd834d6d4f32ac0a576c1732f2f1>:0 
  at Microsoft.DocAsCode.Common.TreeIterator.Preorder[T] (T current, T parent, System.Func`2[T,TResult] childrenGetter, System.Func`3[T1,T2,TResult] action) [0x00036] in <f27dcd834d6d4f32ac0a576c1732f2f1>:0 
  at Microsoft.DocAsCode.Metadata.ManagedReference.CopyInherited.Run (Microsoft.DocAsCode.Metadata.ManagedReference.MetadataModel yaml, Microsoft.DocAsCode.Metadata.ManagedReference.ResolverContext context) [0x00013] in <a8c398537c454be982e8eaec7eb97dd4>:0 
  at Microsoft.DocAsCode.Metadata.ManagedReference.YamlMetadataResolver.ExecutePipeline (Microsoft.DocAsCode.Metadata.ManagedReference.MetadataModel yaml, Microsoft.DocAsCode.Metadata.ManagedReference.ResolverContext context) [0x00015] in <a8c398537c454be982e8eaec7eb97dd4>:0 
  at Microsoft.DocAsCode.Metadata.ManagedReference.YamlMetadataResolver.ResolveMetadata (System.Collections.Generic.Dictionary`2[TKey,TValue] allMembers, System.Collections.Generic.Dictionary`2[TKey,TValue] allReferences, System.Boolean preserveRawInlineComments) [0x00092] in <a8c398537c454be982e8eaec7eb97dd4>:0 
  at Microsoft.DocAsCode.Metadata.ManagedReference.ExtractMetadataWorker+<ResolveAndExportYamlMetadata>d__19.MoveNext () [0x0003b] in <a8c398537c454be982e8eaec7eb97dd4>:0 
  at System.Collections.Generic.List`1[T].AddEnumerable (System.Collections.Generic.IEnumerable`1[T] enumerable) [0x00059] in <533173d24dae460[89](https://github.com/Shazwazza/Examine/actions/runs/6672776075/job/18137309467#step:6:90)9d2b10[97](https://github.com/Shazwazza/Examine/actions/runs/6672776075/job/18137309467#step:6:98)5534bb0>:0 
  at System.Collections.Generic.List`1[T]..ctor (System.Collections.Generic.IEnumerable`1[T] collection) [0x00062] in <533173d24dae460899d2b10975534bb0>:0 
  at System.Linq.Enumerable.ToList[TSource] (System.Collections.Generic.IEnumerable`1[T] source) [0x00018] in <5b415632df1f4365ae2242b1a257bb5b>:0 
  at Microsoft.DocAsCode.Metadata.ManagedReference.ExtractMetadataWorker.SaveAllMembersFromCacheAsync () [0x00be7] in <a8c3[98](https://github.com/Shazwazza/Examine/actions/runs/6672776075/job/18137309467#step:6:99)537c454be982e8eaec7eb97dd4>:0 
  at Microsoft.DocAsCode.Metadata.ManagedReference.ExtractMetadataWorker.ExtractMetadataAsync () [0x000c0] in <a8c398537c454be982e8eaec7eb97dd4>:0

see https://github.com/Shazwazza/Examine/actions/runs/6672776075/job/18137309467

nzdev commented 11 months ago

Fixed on https://github.com/Shazwazza/Examine/pull/356 @Shazwazza

nzdev commented 10 months ago

I've raised a few prs that provide abstractions for the rest of the faceting feature set.

dealloc commented 6 months ago

What is blocking this feature currently from being released? We're doing a rewrite of some pretty complex software that would greatly be simplified if Examine had facets out of the box (and geospatial, but that's not in context here)

Shazwazza commented 6 months ago

Nothing is blocking this, it is already released. I will close this proposal task. There's even docs for it https://shazwazza.github.io/Examine/articles/configuration.html#facets-configuration. Use the latest version of Examine for this functionality.