episerver / EPiServer.Labs.Find.Toolbox

Find Toolbox features an improved synonym implementation, MinimumShouldMatch, MatchPhrase, MatchPrefixPhrase, FuzzyQuery, WildcardQuery and custom CMS search providers
Apache License 2.0
8 stars 2 forks source link

Searches with multi word synonyms produces an unexpected result when using UsingSynonymsImproved() #7

Closed mortenbouvet closed 1 year ago

mortenbouvet commented 2 years ago

Not sure if this is a bug/feature or us not understanding of how UsingSynonymsImproved() is supposed to work.

Describe the bug/Issue Searches with multi word synonyms produces an unexpected result when using UsingSynonymsImproved(). Usually it returns too many hits and hits that are not relevant compared to manually searching the two phrases/terms without adding synonyms

Our search code query = query.For(searchTerm, q => { q.Query = searchTerm; }) .WithAndAsDefaultOperator() .InField(x => x.ItemName) .AndInField(x => x.Code) .AndInField(x => x.MainCategoryName) .AndInField(x => x.SearchWords) .AndInField(x => x.PaidSearchWords) .AndInField(x => x.ItemPc1VendorsitemId) .AndInField(x => x.ItemTradeMarkName) .MinimumShouldMatch("2") .UsingSynonymsImproved(TimeSpan.Zero) .UsingRelevanceImproved(x => x.ItemName) .FuzzyMatch(x => x.ItemName) .WildcardMatch(x => x.ItemName);

Actual vs Expected Behavior A typical example would be searches for the phrases «lett melk» and «lettmelk». Searching these two terms individually will return 30 products (13 + 17). If we add a bidirectional synonym using the phrase «lett melk» and «lettmelk» as a synonym, we expect the result to return all 30 products regardless of what term is used. However searching for the term «lett melk» returns 187 products and the term «lettmelk» returns 17 products.

Using the standard UsingSynonyms() somewhat resolves this issue, however the standard UsingSynonyms() is unreliable with multi word phrases/synonyms and will sometimes cause the search to return an empty result

Additional information: Episerver.Find 13.4.8.0 Episerver.Labs.Find.Toolbox 1,3,1

dada81 commented 2 years ago

Hi @mortenbouvet

Thanks for writing.

Looking at your code I have my suspicions. FuzzyMatch and WildcardMatch could easily give you more results (and some unexpected/unwanted) specially with multiple terms i.e. lett melk instead of merely lettmelk.

But for me to easily understand what the reason for these results are it would be great if you capture the JSON payload of the request done against the _search API endpoint. You should be able to accomplish this with Fiddler filtering for URLS with /_search or if you have the index name and a timestamp I could pull it from our logs.

Please give me some results you want to see and some that you see but don't expect.

Also I think it would suffice with a one directional synonym lett melk -> lettmelk I assume it's not two terms in your data or do you want to get hits for milk and lett when you search for lettmelk as well?

mortenbouvet commented 2 years ago

Hello @dada81

Thanks for the quick reply.

Index: seasservicegrossisteneas_seas01mstr9z6a4prep

I have made multiple search requests both with or without synonyms. Below is an overview of the timestamps and results.

I have also described the expected result for each search

Without synonyms:

Here we expect the two different searches to return different results. This works as expected and returns a very accurate result • «lett melk» o Tue, 26 Jul 2022 07:37:10 GMT o 14 hits

• «lettmelk» o Tue, 26 Jul 2022 07:39:47 GMT o 21 hits

With synomym «lett melk» > «lettmelk» (unidirectional)

Here we expect the search for «lett melk» to return the result for «lett melk» AND «lettmelk» resulting in a totalt of 35 hits (14 + 21). However the term «lett melk» returns 271 hits. From what i can see this is because the result returns products that contain the word «lett» OR «melk». Since we are using a unidirectional synonym the term «lettmelk» returns the expected result

• «lett melk» o Tue, 26 Jul 2022 07:42:23 GMT o 271 hits • Expected product - «MELK LETT 0,25L» • Unexpected product – «HAVREGRYN LETTK. 750G»

• «lettmelk» o Tue, 26 Jul 2022 07:43:06 GMT o 21 hits

With synonym «lett melk» <> «lettmelk» (bidirectional)

Here we see the exact same result as the unidirectional search. We expected this to return the same result for both search terms. So it seems as a bidirectional search might not be working. • «lett melk» o Tue, 26 Jul 2022 07:43:55 GMT o 271 hits

• «lettmelk» o Tue, 26 Jul 2022 07:44:28 GMT o 21 hits

Thank you.

dada81 commented 2 years ago

Looks like we've having some issues with getting the full JSON for these search requests. They are currently truncated and not very useful. Until that is fixed if you could provide these requests with Fiddler locally it would be great.

mortenbouvet commented 2 years ago

Hi again, is it safe to upload the fiddler data on this public repository? My client is unsure about the sensitivity as its production data. Any other possible way for us to share this data with you?

dada81 commented 2 years ago

Hi Morten

You can send it directly to me if that works.

Kind regards, Daniel

From: Morten Lensberg @.> Date: Thursday, 1 September 2022 at 13:29 To: episerver/EPiServer.Labs.Find.Toolbox @.> Cc: Daniel Dahlin @.>, Mention @.> Subject: Re: [episerver/EPiServer.Labs.Find.Toolbox] Searches with multi word synonyms produces an unexpected result when using UsingSynonymsImproved() (Issue #7)

Hi again, is it safe to upload the fiddler data on this public repository? My customer is unsure about the sensitivity as its production data. Any other possible way for us to share this data with you?

— Reply to this email directly, view it on GitHubhttps://github.com/episerver/EPiServer.Labs.Find.Toolbox/issues/7#issuecomment-1234149014, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APSQ2OEHL3O4TQ7C6OTTYT3V4CHQXANCNFSM54H3C3LQ. You are receiving this because you were mentioned.Message ID: @.***>

mortenbouvet commented 2 years ago

Hello Daniel

We have provided request examples using fiddler as requested on https://github.com/episerver/EPiServer.Labs.Find.Toolbox/issues/7

While testing we discovered that this issue might not only be related to multi word synonyms, but also search terms that include a synonym. Example:

We added a synonym for "burger" <> "hamburger" (both unidirectional and bidirectional). We then search for "burger brød". Since we added hamburger as a synonym for burger, we expect the result to return products with (burger OR hamburger) AND brød as the result. However, it seems to us as the result returns products containing burger OR hamburger OR brød. Some products we do not expect to see here is "HAMBURGERKRYDDER 650G" and "POLARBRØD HVETE 16PK 600G".

/Morten

dada81 commented 2 years ago

I believe I understand why this is happening. I think I have a solution for it. Will try to hand you a package for you test during next week.

mortenbouvet commented 2 years ago

Hi,

Sounds great, thanks!

Vh, Morten Lensberg


Fra: Daniel Dahlin @.> Sendt: Monday, September 12, 2022 10:14:54 AM Til: episerver/EPiServer.Labs.Find.Toolbox @.> Kopi: Morten Lensberg @.>; Mention @.> Emne: Re: [episerver/EPiServer.Labs.Find.Toolbox] Searches with multi word synonyms produces an unexpected result when using UsingSynonymsImproved() (Issue #7)

I believe I understand why this is happening. I think I have a solution for it. Will try to hand you a package for you test during next week.

— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fepiserver%2FEPiServer.Labs.Find.Toolbox%2Fissues%2F7%23issuecomment-1243373577&data=05%7C01%7Cmorten.lensberg%40bouvet.no%7Ce987a7c41a354bdd41a408da9496e03c%7Cc317fa72b39344eaa87cea272e8d963d%7C1%7C0%7C637985672974084642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YHV5MUIAHPdPSVG9xAZLV%2B4DRRBEVZWZ8QVDUHT6uu8%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FATEINYYKAIBZDNSN2EAO6EDV53Q75ANCNFSM54H3C3LQ&data=05%7C01%7Cmorten.lensberg%40bouvet.no%7Ce987a7c41a354bdd41a408da9496e03c%7Cc317fa72b39344eaa87cea272e8d963d%7C1%7C0%7C637985672974084642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pcMpUqyqf3ctvYXAHnJv6mteyjzaA1JEFca%2Biq6nQgo%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

mortenbouvet commented 2 years ago

Hi again,

Any news on the updated package?

dada81 commented 2 years ago

Hi Morten,

Sorry for the lack of updates. It’s fixed but I need to do some more testing before I release it.

mortenbouvet commented 2 years ago

Ok, thanks for the update

dada81 commented 2 years ago

Hi Morten

Please try 1.3.2 and let me know how it looks https://github.com/episerver/EPiServer.Labs.Find.Toolbox/blob/master/EPiServer.Labs.Find.Toolbox.1.3.2.nupkg

mortenbouvet commented 2 years ago

Thanks!, will try

mortenbouvet commented 1 year ago

Hello Daniel, Sorry for the late reply.

Version 1.3.2 works a lot better than before and produces more relevant hits when using multi word synonyms. We do however see some minor differences in the results if we use synonyms compared to not using synonyms. We also see some minor differences when using a bidirectional synonym and search for the term/phrase separately (these should in theory produce the same result?)

From the small sample of products I have tested it seems to somehow be related to the position of the word in our product names.

Example:

I have created a bidirectional synonym between “lett melk” and “lettmelk”.

If I search for “lett melk” I get the following result

If I search for “lettmelk” I get the following result

It also looks like there is some issue if the searchterm is a part of the productname. Example:

We have a product named “tinemelk lett”

If we search for “lett melk” without adding synonyms, this product is returned. However, if we add the synonyms from the previous example, this product is no longer returned.

In conclusion this update is a major improvement for us. The issues we have discovered so far are very minor, and we should probably work on updating the productdata on our end to have a more consistent naming convention as this will most likely solve the issues we are having.