Closed nikcio closed 6 months ago
Wow @nikcio 🤩
Good work on this proposal :)
From a user point of view, I also like the approach 2 the most, but it requires more of the implementations.
Sorry for my lack of knowledge, but in what case will IFacetValue.Value
not be an integer/long?
@bergmania in Apache Lucene Faceted Search User's Guide there's written the following under 2.2 Facet Associations
So far we've discussed categories as binary features, where a document either belongs to a category, or not.
While counts are useful in most situations, they are sometimes not sufficiently informative for the user, with respect to deciding which subcategory is more important to display.
For this, the facets package allows to associate a value with a category. The search time interpretation of the associated value is application dependent. For example, a possible interpretation is as a match level (e.g., confidence level). This value can then be used so that a document that is very weakly associated with a certain category will only contribute little to this category's aggregated weight.
So it's possible to use this feature to get floating values. I'm not quite sure myself excatly how this is configured and used so I mostly just based the value type off the FacetResult type from the Lucene.Net.Facet package because it's the output of the different facet readers in Lucene.NET.
@Shazwazza I've created some PR's that could implement this proposal and add some great functionality / Documentation to Examine. Please let me know if I can do anything to help the PR's along.
PRs:
@nikcio Sorry to keep you waiting on this one, I have all of these starred in my inbox and will get to them soon, just a bit swamped this week.
Added support for efficient deep paging (SearchAfter) for faceted and non faceted search https://github.com/Shazwazza/Examine/pull/321.
@nikcio I started having a look through all this yesterday and so far all I can say is what an amazing job you've done so far. I'll keep reviewing over this week and we can determine if there are any tweaks necessary. @nzdev also thanks a ton for your recent PRs and help. Hopefully can get this all merged in for xmas/NY and get what will most likely be a new major release out.
WIP https://github.com/nzdev/Examine/tree/v3/feature/facet-taxonomy Facet Taxonomy Index support. Needed for Hierarchical facets. Also is something like 20%-25% faster according to Lucene.Net docs.
Has there been an update on this?
@dealloc I believe #311 is very close to being done. Just waiting for the stars to align😅🤞
(I'm a little unsure what we are waiting on to be honest 😬😅)
Status:
Here is how I see the status of the Examine repo. What do you think @Shazwazza and @nzdev?
[This list is based on #345 being merged into release/4.0
]
The massive amount of work that has been going on have created some warnings in the project where some are more important than others. Here is a run down of the ones I think we should look at before a stable 4.0 release.
LuceneIndex.cs - line 1212
LuceneIndex.cs - line 1288
I think the best way to get some kind of feel of what still needs to be done is only possible by making an early release and hear around the community for people to test it out. - This can be done when the current PRs in the "Still needs to be merged into Release/4.0" and #345 are merged in. @Shazwazza
I'm wondering if the facetconfig class could be abstracted away by setting hierarchy/ multi facets on the index fields instead
Just to keep this thread continuous also see https://github.com/Shazwazza/Examine/pull/345#issuecomment-1656701699 (From @nzdev )
Here's what I'm thinking. Have the next release of Examine be 4.0, but avoid breaking API changes for 3.x. This means Umbraco 10 and 12 can choose to relax the allowed Examine version to be v3 or v4.
Steps:
- Merge PR Record v3 shipped API using Microsoft.CodeAnalysis.PublicApiAnalyzers #346 which tracks the shipped API for V3.
- Merge PR Fix compatibility with V3 API #347 which merges V3 into V4 and fixes any API compatibility issues.
- Rebase Merges the changes from the
release/3.0
branch torelease/4.0
branch #345 to fix any nullability / xml docs issues.- Add the new api's as unshipped to the txt files and release a beta
- Allow time for feedback, resolve feedback, add new api to the shipped.txt files. (regen the files to include tracking API nullability annotations. This is due to v3 not making nullability claims)
- Release 4.0
If possible I think it would be great if the support for Spatial API https://github.com/Shazwazza/Examine/pull/328 is included in v4 as well.
Something we could have used in a recent project, is faceted search, but where one of the facets is search on items within a distance, e.g. 10, 20, .. or 100 km. It that case I guess facets would be combined with spatial search.
In this specific project we used something like this to combine in with the existing (filtered) query.
public LuceneSearchResults SearchByDistance(Query query, Coordinate coordinate, int distanceInKm, QueryOptions? options = null)
{
if (Index is not LuceneIndex luceneIndex)
throw new InvalidOperationException($"Index {Index.Name} is not a LuceneIndex");
int maxLevels = 11;
// Create an SpatialStrategy
var ctx = SpatialContext.Geo;
var strategy = new RecursivePrefixTreeStrategy(
new GeohashPrefixTree(ctx, maxLevels),
fieldName: Constants.Examine.CourseInstance.FieldNames.GeoLocation);
var lat = coordinate.Latitude;
var lng = coordinate.Longitude;
var results = DoSpatialSearch(ctx, strategy, luceneIndex, query, distanceInKm, lat, lng, options ?? QueryOptions.Default);
return results;
}
private static LuceneSearchResults DoSpatialSearch(
SpatialContext ctx, SpatialStrategy strategy,
LuceneIndex index, Query q, double distanceInKm, double lat, double lng,
QueryOptions options)
{
var searcher = (LuceneSearcher)index.Searcher;
var searchContext = searcher.GetSearchContext();
using ISearcherReference searchRef = searchContext.GetSearcher();
var indexSearcher = searchRef.IndexSearcher;
GetXYFromCoords(lat, lng, out var x, out var y);
var distance = DistanceUtils.Dist2Degrees(distanceInKm, DistanceUtils.EarthMeanRadiusKilometers);
// Make a circle around the search point
var shape = ctx.MakeCircle(x, y, distance);
var args = new SpatialArgs(
SpatialOperation.Intersects, shape);
// Create the Lucene Filter
var filter = strategy.MakeFilter(args);
// Create the Lucene Query
var query = strategy.MakeQuery(args);
var startingPoint = ctx.MakePoint(x, y);
var valueSource = strategy.MakeDistanceValueSource(startingPoint);
var sortByDistance = new Sort(valueSource.GetSortField(false)).Rewrite(indexSearcher);
ValueSourceFilter vsf = new ValueSourceFilter(new QueryWrapperFilter(query), valueSource, 0, distance);
var filteredSpatial = new FilteredQuery(new MatchAllDocsQuery(), vsf);
var spatialRankingQuery = new FunctionQuery(valueSource);
IList<BooleanClause> existingClauses = ((BooleanQuery)q).GetClauses();
BooleanQuery bq = new()
{
{ filteredSpatial, Occur.MUST },
{ spatialRankingQuery, Occur.MUST }
};
var includesStartDate = existingClauses.Where(x => x.Query.ToString().Contains("startDate")).Any();
foreach (var c in existingClauses)
{
var queryString = c.Query.ToString();
if (queryString.Contains("latestValidDate"))
{
if (!includesStartDate)
{
bq.Add(GetRangeQuery(queryString), Occur.MUST);
continue;
}
else
{
continue;
}
}
if (queryString.Contains("startDate"))
{
bq.Add(GetRangeQuery(queryString), Occur.MUST);
continue;
}
bq.Add(c);
}
int maxDoc = indexSearcher.IndexReader.MaxDoc;
var maxResults = Math.Min((options.Skip + 1) * options.Take, maxDoc);
maxResults = maxResults >= 1 ? maxResults : QueryOptions.DefaultMaxResults;
ICollector topDocsCollector = TopFieldCollector.Create(sortByDistance, maxResults, false, false, false, false);
indexSearcher.Search(bq, filter, topDocsCollector);
TopDocs topDocs = ((TopFieldCollector)topDocsCollector).GetTopDocs(options.Skip, options.Take);
var totalItemCount = topDocs.TotalHits;
var results = new List<ISearchResult>();
for (int i = 0; i < topDocs.ScoreDocs.Length; i++)
{
var result = GetSearchResult(i, topDocs, indexSearcher);
results.Add(result);
}
return new LuceneSearchResults(results, totalItemCount);
}
I think for now it would make sense to merge the pr that helps with V3 compatibility and then release a v4 as it's already a big change. After that it's possible to introduce another v4.x release with the other APIs for filtering, function queries, facet drill down and spatial.
Would it be possible to have a Beta or RC build of a release with Facets feature of v3/v4?
Remaining tasks
@nzdev + @nikcio the build for a potential beta is here https://github.com/Shazwazza/Examine/actions/runs/6165490399
If anyone has time, the artifacts have the created Nuget package, would be awesome if someone could test consuming that locally before I publish it to nuget.org?
Works for me
@Shazwazza Let's get the beast out there. I don't have time myself right now to test it but if it works for @nzdev that should be good enough to release the beta 🚀
Hi @Shazwazza . Can you please publish the beta to Nuget. Thanks
Yeah for sure, sorry it's been a hectic month 😕 will get it out tomorrow. Thanks so much for pushing this along and all your support.
@Shazwazza any update on this? 😊 we would love to test this further and we have potential projects where facets would be useful, both in terms of commerce or regular Umbraco content.
Just getting betas out now, just pushed 3.2.0-beta https://github.com/Shazwazza/Examine/releases/tag/v3.2.0-beta.9
And this one out now too https://github.com/Shazwazza/Examine/releases/tag/v4.0.0-beta.1
I'm just trying to get the docfx build running against the release/v4.0 branch but it is failing which I think is due to having
Keeps failing with
[23-10-27 10:21:17.275]Error:Error extracting metadata for /github/workspace/src/Examine.Lucene/Examine.Lucene.csproj,/github/workspace/src/Examine.Core/Examine.Core.csproj,/github/workspace/src/Examine.Host/Examine.csproj: System.NullReferenceException: Object reference not set to an instance of an object
at Microsoft.DocAsCode.Metadata.ManagedReference.CopyInherited.InheritDoc (Microsoft.DocAsCode.Metadata.ManagedReference.MetadataItem dest, Microsoft.DocAsCode.Metadata.ManagedReference.ResolverContext context) [0x0007f] in <a8c39[85](https://github.com/Shazwazza/Examine/actions/runs/6672776075/job/18137309467#step:6:86)37c454be982e8eaec7eb97dd4>:0
at Microsoft.DocAsCode.Metadata.ManagedReference.CopyInherited+<>c__DisplayClass0_0.<Run>b__1 (Microsoft.DocAsCode.Metadata.ManagedReference.MetadataItem current, Microsoft.DocAsCode.Metadata.ManagedReference.MetadataItem parent) [0x00008] in <a8c398537c454be982e8eaec7eb97dd4>:0
at Microsoft.DocAsCode.Common.TreeIterator.Preorder[T] (T current, T parent, System.Func`2[T,TResult] childrenGetter, System.Func`3[T1,T2,TResult] action) [0x0000c] in <f27dcd834d6d4f32ac0a576c1732f2f1>:0
at Microsoft.DocAsCode.Common.TreeIterator.Preorder[T] (T current, T parent, System.Func`2[T,TResult] childrenGetter, System.Func`3[T1,T2,TResult] action) [0x00036] in <f27dcd834d6d4f32ac0a576c1732f2f1>:0
at Microsoft.DocAsCode.Common.TreeIterator.Preorder[T] (T current, T parent, System.Func`2[T,TResult] childrenGetter, System.Func`3[T1,T2,TResult] action) [0x00036] in <f27dcd834d6d4f32ac0a576c1732f2f1>:0
at Microsoft.DocAsCode.Common.TreeIterator.Preorder[T] (T current, T parent, System.Func`2[T,TResult] childrenGetter, System.Func`3[T1,T2,TResult] action) [0x00036] in <f27dcd834d6d4f32ac0a576c1732f2f1>:0
at Microsoft.DocAsCode.Metadata.ManagedReference.CopyInherited.Run (Microsoft.DocAsCode.Metadata.ManagedReference.MetadataModel yaml, Microsoft.DocAsCode.Metadata.ManagedReference.ResolverContext context) [0x00013] in <a8c398537c454be982e8eaec7eb97dd4>:0
at Microsoft.DocAsCode.Metadata.ManagedReference.YamlMetadataResolver.ExecutePipeline (Microsoft.DocAsCode.Metadata.ManagedReference.MetadataModel yaml, Microsoft.DocAsCode.Metadata.ManagedReference.ResolverContext context) [0x00015] in <a8c398537c454be982e8eaec7eb97dd4>:0
at Microsoft.DocAsCode.Metadata.ManagedReference.YamlMetadataResolver.ResolveMetadata (System.Collections.Generic.Dictionary`2[TKey,TValue] allMembers, System.Collections.Generic.Dictionary`2[TKey,TValue] allReferences, System.Boolean preserveRawInlineComments) [0x00092] in <a8c398537c454be982e8eaec7eb97dd4>:0
at Microsoft.DocAsCode.Metadata.ManagedReference.ExtractMetadataWorker+<ResolveAndExportYamlMetadata>d__19.MoveNext () [0x0003b] in <a8c398537c454be982e8eaec7eb97dd4>:0
at System.Collections.Generic.List`1[T].AddEnumerable (System.Collections.Generic.IEnumerable`1[T] enumerable) [0x00059] in <533173d24dae460[89](https://github.com/Shazwazza/Examine/actions/runs/6672776075/job/18137309467#step:6:90)9d2b10[97](https://github.com/Shazwazza/Examine/actions/runs/6672776075/job/18137309467#step:6:98)5534bb0>:0
at System.Collections.Generic.List`1[T]..ctor (System.Collections.Generic.IEnumerable`1[T] collection) [0x00062] in <533173d24dae460899d2b10975534bb0>:0
at System.Linq.Enumerable.ToList[TSource] (System.Collections.Generic.IEnumerable`1[T] source) [0x00018] in <5b415632df1f4365ae2242b1a257bb5b>:0
at Microsoft.DocAsCode.Metadata.ManagedReference.ExtractMetadataWorker.SaveAllMembersFromCacheAsync () [0x00be7] in <a8c3[98](https://github.com/Shazwazza/Examine/actions/runs/6672776075/job/18137309467#step:6:99)537c454be982e8eaec7eb97dd4>:0
at Microsoft.DocAsCode.Metadata.ManagedReference.ExtractMetadataWorker.ExtractMetadataAsync () [0x000c0] in <a8c398537c454be982e8eaec7eb97dd4>:0
see https://github.com/Shazwazza/Examine/actions/runs/6672776075/job/18137309467
Fixed on https://github.com/Shazwazza/Examine/pull/356 @Shazwazza
I've raised a few prs that provide abstractions for the rest of the faceting feature set.
What is blocking this feature currently from being released? We're doing a rewrite of some pretty complex software that would greatly be simplified if Examine had facets out of the box (and geospatial, but that's not in context here)
Nothing is blocking this, it is already released. I will close this proposal task. There's even docs for it https://shazwazza.github.io/Examine/articles/configuration.html#facets-configuration. Use the latest version of Examine for this functionality.
Examine Facets proposal
What is faceted search?
Faceted search is a technique that involves augmenting traditional search techniques with a faceted navigation system, allowing users to narrow down search results by applying multiple filters based on faceted classification of the items. It is sometimes referred to as a parametric search technique. A faceted classification system classifies each information element along multiple explicit dimensions, called facets, enabling the classifications to be accessed and ordered in multiple ways rather than in a single, pre-determined, taxonomic order. (Source)
Description
This proposal is on the implementation of faceted search in Examine. The proposal is mostly based on finding reasonable interface abstractions the best approach for building the feature and less on specific implementation details.
Previous implementation
Facets are available for Examine when targeting .NET framework via the Examine.Facets package by Callum Whyte. This proposal is based on the implementation of that package.
Motivation
I'm currently working on a project which would have a great use for Examine facets.
Approach
1. Externalize the Faceting features
The first approach is to externalize the Faceting features to a separate Nuget package in the same way Examine.Facets works a POC of this approach can be seen here: POC: Examine.Facets
2. Internalize the Faceting features (I think this is the best approach)
The second approach is to make faceting available directly in the existing
Examine
package and existing classes this would make it possible to avoid creating the following implementations and instead add the features to the existing classes:This approach, therefore, allows the default searcher to do a faceted search and this will lower the barrier to entry because a developer wouldn't have to register a separate searcher and explicitly use this searcher for faceted searching. As seen in this example:
Example 1 (From the existing Examine.Facets package by Callum)
Example 2 (From a test in the POC)
Example source
Structure
Bases
Searching
Searching results
Example facet result
Extensions
Types of facets
Sources of information about Lucene's facet search are:
String Facet
Allows for counting the documents that share the same string value.
New
FieldDefinitionTypes
:FacetFullText
FacetFullTextSortable
Extends the existing
FullText
andFullTextSortable
type and adds the requiredSortedSetDocValuesFacetField
to the indexed document. Without this field,SortedSetDocValuesFacetCounts
will not work.New query methods
On
IQuery
New
IFacetField
Facets config / New index methods - Optional addition. Properly not the most used feature
FacetsConfig
allows for setting some values in the index which are useful for faceting API docsOn
LuceneIndexOptions
This will make it possible to set the facet configuration on the specific index and reuse it when searching.
Methods to change the field used when reading facets (default is
$facets
which is where all facet values are indexed ifFacetsConfig.SetIndexFieldName(dimName, indexFieldName)
is not called.):See
IFacetQueryField
(It's not possible to specify the reading field in range facets)This will make it possible to set the faceting field per facet field giving the most flexibility when composing a query,
Note: The
FacetConfig
will also need to be available at search time in the searchExecutor to be used in the constructor when using TaxonomyNumeric Range Facet
Used with numbers to build range facets. For example, it would group documents of the same price range.
Double Range
New
FieldDefinitionTypes
:FacetDouble
FacetFloat
Extends the existing
Double
andFloat
type and adds the requiredDoubleDocValuesField
andSingleDocValuesField
respectively, aswell as theSortedSetDocValuesFacetField
to enable string like faceting, to the indexed document. Without the fields,DoubleDocValuesField
andSingleDocValuesField
faceting will not work.New query methods
On
IQuery
New
IFacetField
Long Range / Numeric range
New
FieldDefinitionTypes
:FacetInt
FacetLong
FacetDateTime
FacetDateYear
FacetDateMonth
FacetDateDay
FacetDateHour
FacetDateMinute
Extends the existing types and adds the required
NumericDocValuesField
, aswell as theSortedSetDocValuesFacetField
to enable string like faceting, to the indexed document. Without the fields,NumericDocValuesField
faceting will not work.New query methods
On
IQuery
New
IFacetField
Taxonomy Facet
Doing Taxonomy requires using a speciffic writer (
DirectoryTaxonomyWriter
) and is therefore out of the scope of this proposal.See more at: https://norconex.com/facets-with-lucene/
What now