Shazwazza / Examine

A .NET indexing and search engine powered by Lucene.Net
https://shazwazza.github.io/Examine/
375 stars 123 forks source link

Raw lucene query change after recycling app pool #181

Open bjarnef opened 4 years ago

bjarnef commented 4 years ago

On a project on Umbraco Cloud we have a filter dropdown to filter a product list based on sizes. Some of these values contains Danish characters æ, ø and å.

However we have notices that this sometimes didn't work after a deploy or rebuilding ModelsBuilder. It seems to come down to when recycling app pool as we can reproduce the issue after making a change in web.config

The project is using Umbraco v8.6.4 and Examine v1.0.5

Here is the result before recycle. Note the raw lucene query is:

+__NodeTypeAlias:productpage +categoryParents:1186 +(size:"nyfodt/50 cm")

devenv_2020-09-10_11-01-01

After application is recycled and refreshing page where no results are returned:

+__NodeTypeAlias:productpage +categoryParents:1186 +(size:"nyfødt/50 cm")

devenv_2020-09-10_10-59-00

When rebuilding external index from Examine dashboard it works again and generate the first lucene query.

For now we have implemented this temporary hack:

if (!string.IsNullOrEmpty(size))
{
    var sizes = size?.Split(new[] { '|' }, StringSplitOptions.RemoveEmptyEntries);

    // Hack to ensure examine generate same query after recycling app pool.
    sizes = sizes.Select(x => x.Replace("æ", "ae").Replace("ø", "o").Replace("å", "a")).ToArray();

    query = query.And().GroupedOr(new [] { "size" }, sizes);
}

Strange enough when there are no results, but searching for "Nyfødt/50 cm" in Examine dashboard it seems to find a result and the product list returns a result in frontend without I clicking the "Rebuild index" button. Not sure if the a search is triggering a reindex or the found results or full reindex?

We have also seems similar issue when searching using .NativeQuery() method and using æ, ø and å in search term, but not sure if it is the same underlying issue.

Shazwazza commented 4 years ago

That's a pretty strange issue! Sounds like something to do with Thread culture or something along those lines.

Are you able to provide me with the full code that creates this query ... or better yet a slimmed down version of a way that i can try to replicate ?

bjarnef commented 4 years ago

Very strange indeed! 🦒🐰

Sure, here is they entire controller:

public class ProductSurfaceController : SurfaceController
{
    private readonly IExamineManager _examineManager;

    public ProductSurfaceController(IExamineManager examineManager)
    {
        _examineManager = examineManager;
    }

    [ChildActionOnly]
    public ActionResult FeaturedProducts()
    {
        var featuredProducts = CurrentPage.GetHomePage()
            .FeaturedProducts.OfType<ProductPage>();

        return PartialView("ProductList", featuredProducts);
    }

    [ChildActionOnly]
    public ActionResult ProductListByCollection(int collectionId, int p = 1, int ps = 12)
    {
        return PartialView("PagedProductList", GetPagedProducts(collectionId, null, p, ps));
    }

    [ChildActionOnly]
    public ActionResult ProductListByCategory(string category, int p = 1, int ps = 12)
    {
        return PartialView("PagedProductList", GetPagedProducts(null, category, p, ps));
    }

    private PagedResult<ProductPage> GetPagedProducts(int? collectionId, string category, int page, int pageSize)
    {
        if (_examineManager.TryGetIndex("ExternalIndex", out var index))
        {
            var searcher = index.GetSearcher();
            var query = searcher.CreateQuery()
                .Field("__NodeTypeAlias", ProductPage.ModelTypeAlias);

            if (collectionId.HasValue)
            {
                query = query.And().Field("parentID", collectionId.Value);
            }

            if (!category.IsNullOrWhiteSpace())
            {
                query = query.And().Field("categoryAliases", category);
            }

            var results = query.OrderBy(new SortableField("name", SortType.String)).Execute(pageSize * page);
            var totalResults = results.TotalItemCount;
            var pagedResults = results.Skip(pageSize * (page - 1));

            return new PagedResult<ProductPage>(totalResults, page, pageSize)
            {
                Items = pagedResults.Select(x => UmbracoContext.Content.GetById(int.Parse(x.Id))).OfType<ProductPage>()
            };
        }

        return new PagedResult<ProductPage>(0, page, pageSize);
    }
}

This is more or less a copy from Vendr demo store with a few tweaks. https://github.com/vendrhub/vendr-demo-store/blob/main/src/Vendr.DemoStore/Web/Controllers/ProductSurfaceController.cs

I think you can reproduce the issue demo store, which you can clone from here: https://github.com/vendrhub/vendr-demo-store

bjarnef commented 4 years ago

@Shazwazza I could reproduce the issue in Vendr demostore.

Steps to reproduce

  1. Add some new categories containing æ, ø or å (might also be relevant for other specific characters in other cultures). I chosen "Øl" = beer 🍺

image

  1. Select this category on one or more products. I have chosen "Home / Products / Good and Proper / Iron Buddha"

image

  1. In DemoStoreComponent.cs extend Examine to index category names as well.
if (e.ValueSet.ItemType.InvariantEquals(ProductPage.ModelTypeAlias))
{
    // Make sure some categories are defined
    if (e.ValueSet.Values.ContainsKey("categories"))
    {
        // Prepare a new collection for category aliases
        var categoryAliases = new List<string>();
        var categoryNames = new List<string>();

        // Parse the comma separated list of category UDIs
        var categoryIds = e.ValueSet.GetValue("categories").ToString().Split(',').Select(GuidUdi.Parse).ToList();

        // Fetch the category nodes and extract the category alias, adding it to the aliases collection
        using (var ctx = _umbracoContextFactory.EnsureUmbracoContext())
        {
            foreach (var categoryId in categoryIds)
            {
                var category = ctx.UmbracoContext.Content.GetById(categoryId);
                if (category != null)
                {
                    categoryAliases.Add(category.UrlSegment);
                    categoryNames.Add(category.Name);
                }
            }
        }

        // If we have some aliases, add these to the lucene index in a searchable way
        if (categoryAliases.Count > 0)
        {
            e.ValueSet.Add("categoryAliases", string.Join(" ", categoryAliases));
        }

        if (categoryAliases.Count > 0)
        {
            e.ValueSet.Add("categoryNames", string.Join(" ", categoryNames));
        }
    }
}
  1. Change to the following in ProductSurfaceController:
private PagedResult<ProductPage> GetPagedProducts(int? collectionId, string category, int page, int pageSize)
{
    category = Request.QueryString["category"];

    if (_examineManager.TryGetIndex("ExternalIndex", out var index))
    {
        var searcher = index.GetSearcher();
        var query = searcher.CreateQuery()
            .Field("__NodeTypeAlias", ProductPage.ModelTypeAlias);

        if (collectionId.HasValue)
        {
            query = query.And().Field("parentID", collectionId.Value);
        }

        if (!category.IsNullOrWhiteSpace())
        {
            //query = query.And().Field("categoryAliases", category);
            query = query.And().Field("categoryNames", category);
        }

        var results = query.OrderBy(new SortableField("name", SortType.String)).Execute(pageSize * page);
        var totalResults = results.TotalItemCount;
        var pagedResults = results.Skip(pageSize * (page - 1));

        return new PagedResult<ProductPage>(totalResults, page, pageSize)
        {
            Items = pagedResults.Select(x => UmbracoContext.Content.GetById(int.Parse(x.Id))).OfType<ProductPage>()
        };
    }

    return new PagedResult<ProductPage>(0, page, pageSize);
}
  1. Compile and rebuild external index via Examine dashboard.

  2. Set a breakpoint in GetPagedProducts(), start debugging and navigate to this url: /products/good-and-proper/?category=øl You should be able to see this result using the following raw lucene query:

+__NodeTypeAlias:productpage +(parentID:[1147 TO 1147]) +categoryNames:ol

image

  1. Make a small change in web.config or recycle app pool from IIS.

  2. Access the same url as before. It should now use ø instead in raw lucene query and not return any result in frontend.

+__NodeTypeAlias:productpage +(parentID:[1147 TO 1147]) +categoryNames:øl

image

  1. When rebuilding index from Examine dashboard it should return a result again.
Shazwazza commented 4 years ago

So is this an issue with the data that is stored in the index or the query that is being produced? There are of course 2 variations of this query:

+__NodeTypeAlias:productpage +(parentID:[1147 TO 1147]) +categoryNames:ol

and

+__NodeTypeAlias:productpage +(parentID:[1147 TO 1147]) +categoryNames:øl

That category string is coming from the string category parameter of the GetPagedProducts method.

What is the expectation here? which query is correct? And where is the data coming from to populate the string category parameter?

Is the data going into the index in your DemoStoreComponent.cs var categoryNames = new List<string>(); consistently?

bjarnef commented 4 years ago

I am not sure if it is an issue with the stored data in the index. But I would expect the raw lucene query to be identical before and after recycle.

Does a raw lucene query normally work using culture specific characters like Danish æ, ø and å? From what I have seen this would be replaced to the following:

æ => ae ø => o å => a

When rebuilding the index, it does however seem both versions work, but after recycle only the first version with o. (where Examine generate the query with ø).

Yes, but the value of category value is overwritten here for testing: category = Request.QueryString["category"]; (but you could change it where the method is called, if you want).

The data is coming from the querystring category in the url: /products/good-and-proper/?category=øl

In this case just to test when ø is passed in as value an Examine generate raw lucene query with o but with ø after recycling app pool 😊

Shazwazza commented 4 years ago

The data is coming from the querystring category in the url:

Yes exactly, it's not Examine changing this value from ø => o, this is the value that is just being passed to it via the query string. So I'm pretty sure the problem starts with how that is happening. It might not be an examine issue at all?

bjarnef commented 4 years ago

I don't understand how the querystring value should change after doing a recycle of app pool. Also it doesn't explain why it works again after rebuilding the index. So I am pretty sure it is either an issue in the Examine (lucene) query or how the data is indexed. Maybe there's of difference between rebuilding the index via Examine dashboard and how Examine rebuild the index in background on app pool recycle?

Off topic: On another Umbraco Cloud project we are using querystring to do a search using Clerk.io, but I haven't noticed this should change a term containing ø to o.

I can try with a static variable instead, but I don't think that would make a difference from my previous observations.

bjarnef commented 4 years ago

Before recycling app pool, when it works:

image

After recycling app pool and it doesn't work:

image

The value of the querystring category is øl in both cases, but the generated lucene query by Examine is different. 🤔🤷‍♂️

bjarnef commented 4 years ago

I have specific tried with string category = "øl"; in code as well without using querystring, but with same result as mentioned here.

bjarnef commented 4 years ago

@Shazwazza did you had a chance to reproduce this in e.g. Vendr demo store following these steps?

Shazwazza commented 4 years ago

no not yet

bielu commented 2 years ago

@bjarnef I had similar issue lately, I find out that is caused by that how Lucene will load vectors to memory, apparently vectors loaded from file are different than that indexed to memory. I resolved issue by replacing Standard Analyser with simple analyser, what is not perfect but at least do a job :)

bielu commented 2 years ago

@bjarnef I actually debug that further, and I was wrong! I didnt resolve that by change of analyser, but my code contains other fixes for relevance. Issue is caused by how norms are omitted internally in lucene. code which resolve issue for me is :

 public class NormalizedTextFactory : IFieldValueTypeFactory
    {
        public IIndexFieldValueType Create(string fieldName)
        {
            return new NormalizedFullTextType(fieldName, new StandardAnalyzer(Version.LUCENE_30), false);
        }
    }
    public class NormalizedFullTextType : FullTextType
    {
        private readonly bool _sortable;

        public NormalizedFullTextType(string fieldName, Analyzer analyzer = null, bool sortable = false) : base(fieldName, analyzer, sortable)
        {
            _sortable = sortable;
        }
        protected override void AddSingleValue(Document doc, object value)
        {
            if (TryConvert<string>(value, out var str))
            {
                var field = new Field(FieldName, str.Replace("\"",""), Field.Store.YES, Field.Index.ANALYZED);
                field.OmitNorms = true;
                doc.Add(field);

            }
        }
    }