OnTheGoSystems / wpml-elasticpress

WPML ElasticPress adds support for languages in ElasticPress.
GNU General Public License v3.0
10 stars 3 forks source link

Fix post filter for lang codes with a dash (zh-hans) #20

Closed Dehax closed 1 year ago

Dehax commented 1 year ago

Added keyword for post_lang in order to fix filtering by "zh-hans".

Language codes with dashes give multiple tokens when analyzed:

POST /:index/_analyze
{
    "analyzer": "post_lang_field",
    "text": "zh-hans"
}
{
    "tokens": [
        {
            "token": "zh",
            "start_offset": 0,
            "end_offset": 2,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "hans",
            "start_offset": 3,
            "end_offset": 7,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

And this filter always gives 0 results:

'term' => [
    'post_lang' => 'zh-hans',
],

So in order to fix the post filtering issue for such language codes we need to either use additional "keyword" field with "keyword" data type or use "keyword" analyzer for post_lang field or just map post_lang field to "keyword" data type.

decodekult commented 1 year ago

Hi there, @Dehax

Thanks for the feedback, and the time you took to craft this pull request.

I am glad to inform you that we have this already solved in https://github.com/OnTheGoSystems/wpml-elasticpress/pull/13 (and an extra, small refinement is on its way).

We will be having a new release for this plugin relatively soon, including this fix among other changes from recent pull requests, and some other surprises.

If you do not mind, I am closing this as the defect fixed here was already fixed in the link above.

Thanks again!

Dehax commented 1 year ago

Hi @decodekult ,

I'm using the newest version from master branch (commit b559d0b) which already includes changes from #13 and this issue is not resolved in it yet. This version still always returns 0 search results when filtering by "zh-hans" language although Elasticsearch index contains many documents with "post_lang": "zh-hans". I hope the small refinement you are mentioning will fix this "dash issue". Will be waiting for the fix. Thanks!

decodekult commented 1 year ago

Hi @Dehax

I have checked this deeply, and you are right! The change mentioned above just included a new analyzer for the language field, but it still uses a standard tokenizer, which splits content by dashes. In addition, another related commit about including support for posts set to display as translated when a translation is missing also introduced some unwanted outcome.

I will be submitting a pull request addressing both things later today or early tomorrow.

In any case, we are playing with alternatives for https://github.com/OnTheGoSystems/wpml-elasticpress/issues/6 that will probably result in not having, or not needing, a language field.

I will update here once I am done. Thanks a lot!

decodekult commented 1 year ago

This is gettign fixed at https://github.com/OnTheGoSystems/wpml-elasticpress/pull/22

As described in the related issue on https://github.com/OnTheGoSystems/wpml-elasticpress/issues/21 we can not turn this field into a keyword. But we can improve its analyzer.

Keeping this one open until we merge and confirm the other solution.

FYI @Dehax

Dehax commented 1 year ago

Hi @decodekult ,

Thanks for the update! It seems that char_group tokenizer was added in Elasticsearch 6.4.0 and thus it's not available in 5.2.2 (the version providen by ElasticPress). I've confirmed this at "ElasticPress Sync" page. After pressing "Delete all Data and Start a Fresh Sync" button the sync process fails with the error:

Mapping failed: Unknown tokenizer type [char_group] for [post_lang_tokenizer]

decodekult commented 1 year ago

Oh well, I just merged the change. I think it opens the issue again.

Sorry for the back-and-forth.

decodekult commented 1 year ago

This one should fix the problem for good: https://github.com/OnTheGoSystems/wpml-elasticpress/pull/25

Thanks for your patience.

Dehax commented 1 year ago

Yes, #25 works with Elasticsearch 5.2.2 and fixes #21. Thanks!

decodekult commented 1 year ago

Great indeed!

Thanks a lot.