10up / ElasticPress

A fast and flexible search and query engine for WordPress.
https://elasticpress.io
GNU General Public License v2.0
1.23k stars 312 forks source link

BUG: Creating synonym sets with spaces causes indexing problems, illegal_argument_exception #2838

Closed Finnstax closed 2 years ago

Finnstax commented 2 years ago

Describe the bug If I add a synonym set containing a space. no post containing that set will be indexed

The post with id 845 has the post_title 'Internet of Things (IoT)' which I will use to demonstrate this effect.

  1. No synonym

=> No synonym set is added via the dashboard ➡️ the post is getting indexed

wp elasticpress index --include=845 --show-errors
Success: Indexing posts…
Skipping 0 posts…
Processed posts 0 - 1 of 1. Last Object ID: 845
Success: Number of posts indexed: 1
Success: Sync complete
Total time elapsed: 00:00:00.344000
Success: Done!
  1. Synonym with space

=> If i add the following synonym set via the dashboard

# Defined sets ( equivalent synonyms).
IoT, Internet of Things

➡️ the post (and every post containing the synonym, so each blog post with IoT or Internet of Things) will throw an error on indexing and can not be found

wp elasticpress index --include=845 --show-errors
Success: Indexing posts…
Skipping 0 posts…
Warning: 845 (Post): [illegal_argument_exception] startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=20,endOffset=23,lastStartOffset=22 for field 'post_title'
Processed posts 0 - 1 of 1. Last Object ID: 845
Warning: Number of posts index errors: 1
Success: Number of posts indexed: 0
Success: Sync complete
Total time elapsed: 00:00:00.314000
Success: Done!
  1. Synonym without spaces

=> If i add the following synonym set via the dashboard

# Defined sets ( equivalent synonyms).
IoT, InternetofThings

➡️ the post is getting index

wp elasticpress index --include=845 --show-errors
Indexing posts...
Processed 1/1. Last Object ID: 845
Number of posts indexed: 1
Total time elapsed: 0.713
Success: Done!

Environment information

felipeelia commented 2 years ago

Hi @Finnstax, thanks for opening the issue, I am able to reproduce it locally.

In my tests, if I keep IoT, Internet Of Things as a synonym set and change the post title to simply IoT I still get the same error. Removing the spaces in Internet of Things makes it work.

That said, it seems you found an edge case. We are currently using both word_delimiter_graph and synonym_graph during index time, and that is not fully supported. We could avoid using synonym_graph during sync time by setting a default_search analyzer as discussed in the past. We might get back to that approach.

While we discuss it, do you mind adding the following snippet to your codebase and checking if that fixes the problem?

add_filter(
    'ep_post_mapping',
    function( $mapping ) {
        try {
            $mapping['settings']['analysis']['filter']['ewp_word_delimiter']['split_on_case_change'] = false;
        } catch ( \Throwable $th ) {
            /* noop */
        }
        return $mapping;
    }
);

Thanks!

Finnstax commented 2 years ago

Hey @felipeelia - thank you so much for your reply.

Adding the snippet does unfortunately nothing. I still get the same results as described above.

Is there anything else I could try? Or should we just avoid using synonyms with spaces for now?

felipeelia commented 2 years ago

Sorry @Finnstax, I should have mentioned that you'll need to send the mapping again. Can you please run wp elasticpress index --setup --yes --show-errors? This will delete your index, send the correct mapping, and sync everything again. Thanks!

Finnstax commented 2 years ago

Hey @felipeelia, thank you. That worked flawlessly.

The synonyms with spaces are now also getting indexed and are no longer throwing errors.

I noticed that the synonyms (unlike the other search results) are case sensitive - is there a way to fix this too?

felipeelia commented 2 years ago

Hey @Finnstax, glad to read it worked! Have you already tried to set the synonyms in lower case? I think ElasticPress already indexes all the content in lowercase, so that should work. If that does not work, you can also try to lower case the search term sent by users too.

felipeelia commented 2 years ago

I'm going ahead and closing this one out in favor of #2877.