10up / ElasticPress

A fast and flexible search and query engine for WordPress.
https://elasticpress.io
GNU General Public License v2.0
1.25k stars 313 forks source link

Diacritical-insensitive search #2029

Closed pjohanneson closed 3 years ago

pjohanneson commented 3 years ago

Describe the bug I would like to be able to search for text in post & page content in an ASCII-folded manner. e.g., if I have a post containing the word piñata, I would like searches for both pinata and piñata to find the post.

Steps to Reproduce

  1. Create a post with the word piñata in it.
  2. Search the site for pinata (eg example.com/?s=pinata
  3. The post is not found.

If I'm using WordPress's native search, pinata will find posts with piñata in the content.

Also, it appears that ElasticPress will return posts with piñata in the title when I search for pinata.

Environment information

version: 5.6 site_language: en_US user_language: en_US timezone: +00:00 permalink: /%year%/%monthnum%/%day%/%postname%/ https_status: false multisite: true user_registration: false blog_public: 1 default_comment_status: open environment_type: local user_count: 2 site_count: 11 network_count: 1 dotorg_communication: true

wp-dropins (1)

advanced-cache.php: true

wp-active-theme

name: Twenty Twenty (twentytwenty) version: 1.6 author: the WordPress team author_website: https://wordpress.org/ parent_theme: none theme_features: core-block-patterns, automatic-feed-links, custom-background, post-thumbnails, custom-logo, title-tag, html5, align-wide, responsive-embeds, customize-selective-refresh-widgets, editor-color-palette, editor-font-sizes, editor-styles, widgets, menus, editor-style theme_path: /srv/www/vhosts/brandonu.local/wp-content/themes/twentytwenty auto_update: Disabled

wp-themes-inactive (17)

berkley: version: 1.0.0, author: BU Web Team, Auto-updates disabled Brandon University Lockdown: author: Brandon University web team, version: (undefined), Auto-updates disabled Brandon University 2014: version: 1.0, author: Patrick Johanneson, Auto-updates disabled Brandon U Store (Nozama Child): version: 1.0, author: CSSIgniter, Auto-updates disabled Brandon University Theme 2016: version: 1.0, author: BU Web Team, Auto-updates disabled The Canadian Journal of Native Studies: version: 1, author: Darcy Margetts, Auto-updates disabled Corner: version: 3.1.1, author: CSSIgniter, Auto-updates disabled eskobear: version: 1.0.0, author: Greg Misener, Auto-updates disabled Nozama: version: 1.9.3, author: CSSIgniter, Auto-updates disabled Corner Child: version: 1.0, author: CSSIgniter, Auto-updates disabled SWAAC (fork of Twenty Twelve): version: 1.1, author: the WordPress team, Auto-updates disabled Thematic: version: 1.0.4, author: Ian Stewart & Chris Goßmann, Auto-updates disabled Twenty Fifteen: version: 2.8, author: the WordPress team, Auto-updates disabled Twenty Nineteen: version: 1.9, author: the WordPress team, Auto-updates disabled Twenty Seventeen: version: 2.5, author: the WordPress team, Auto-updates disabled Twenty Sixteen: version: 2.3, author: the WordPress team, Auto-updates disabled Twenty Twenty-One: version: 1.1, author: the WordPress team, Auto-updates disabled

wp-mu-plugins (41)

Admin Notice on Inactive Site µ: author: (undefined), version: 1.0.0 Allow CSV µ: author: (undefined), version: 1.0.0 Brandon U Admin Bar: author: (undefined), version: (undefined) Brandon U Disable Comments: author: (undefined), version: (undefined) Brandon U Menus: author: (undefined), version: (undefined) Brandon U User Lockout: author: (undefined), version: (undefined) Brandon U User Site List: author: (undefined), version: (undefined) Brandon U WP-Login Tweaks (MU): author: (undefined), version: (undefined) bu-events-settings.php: author: (undefined), version: (undefined) BU Local IP Range µ: author: (undefined), version: 1.0.0 debug.php: author: (undefined), version: (undefined) Dump (debugging tool): author: (undefined), version: (undefined) Eggplant That Up: author: (undefined), version: (undefined) ep.php: author: (undefined), version: (undefined) GF Bambora Network Settings (µ): author: (undefined), version: 1.0.0 GF Remove CC µ: author: (undefined), version: 1.0.0 GF Unique Order Number µ: author: (undefined), version: 1.0.0 Global Terms Turn On: author: (undefined), version: (undefined) Gravity Forms Total Field Conditional Logic µ: author: (undefined), version: 1.0.0 Gravity Forms Validation Message µ: author: (undefined), version: 1.0.0 Live Chat user level µ: author: (undefined), version: (undefined) MCE Lockdown µ: author: (undefined), version: 1.1.0 p3-profiler.php: author: (undefined), version: (undefined) PJ Rest Restrictions: author: (undefined), version: (undefined) Remove Quick Edit µ: author: (undefined), version: 1.0.0 restricted-admin.php: author: (undefined), version: (undefined) Restrict Env Display µ: author: (undefined), version: 1.0.0 Shortened Site Title: author: (undefined), version: (undefined) simple-history-disable-rss.php: author: (undefined), version: (undefined) Site Health local settings µ: author: (undefined), version: 1.0.0 Sitemap Shortcode µ: author: (undefined), version: 1.0.0 SSH FS checker: author: (undefined), version: (undefined) ssl-on-local.php: author: (undefined), version: (undefined) SSL WordPress: author: (undefined), version: (undefined) wp-mail.php: author: (undefined), version: (undefined) wp-search.php: author: (undefined), version: (undefined) µ Edit Flow Extensions: author: (undefined), version: (undefined) µ Gravity Forms CSS/HTML5 force settings: author: (undefined), version: 1.0.0 µ Gravity Forms Export Form Filename: author: (undefined), version: 1.0.0 µ MiraPay Finder: author: (undefined), version: 1.0.0 µ Sort Sites: author: (undefined), version: 1.0.0

wp-plugins-active (16)

Advanced Editor Tools (previously TinyMCE Advanced): version: 5.6.0, author: Automattic, Auto-updates disabled Brandon U Customizer for Editors: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U ElasticPress Extensions: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Navigation: version: 0.1, author: Patrick Johanneson, Auto-updates disabled Brandon U Office Hours: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U REST Multisite Tools: author: (undefined), version: 1.0.0, Auto-updates disabled CMB2: version: 2.7.0, author: CMB2 team, Auto-updates disabled Display Environment Type: version: 1.3, author: Roy Tanck, Auto-updates disabled ElasticPress: version: 3.5.1, author: 10up, Auto-updates disabled Gravity Forms: version: 2.4.19, author: Gravity Forms, Auto-updates disabled User Switching: version: 1.5.6, author: John Blackbourn & contributors, Auto-updates disabled User Switching in Admin Bar: version: 1.2, author: Dražen Bebić, Auto-updates disabled WordPress Importer: version: 0.7, author: wordpressdotorg, Auto-updates disabled WP-REST-API V2 Menus: version: 0.8, author: Claudio La Barbera, Auto-updates disabled WP Gatsby: version: 0.9.1, author: GatsbyJS, Jason Bahl, Tyler Barnes, Auto-updates disabled WP GraphQL: version: 1.0.3, author: WPGraphQL (latest version: 1.1.2), Auto-updates disabled

wp-plugins-inactive (100)

Admin Page Framework - Loader: version: 3.8.25, author: Michael Uno, Auto-updates disabled Akismet Anti-Spam: version: 4.1.7, author: Automattic (latest version: 4.1.8), Auto-updates disabled All-in-One Event Calendar by Time.ly: version: 2.6.8, author: Time.ly Network Inc., Auto-updates disabled Autoptimize: version: 2.8.1, author: Frank Goossens (futtta), Auto-updates disabled avb: author: (undefined), version: (undefined), Auto-updates disabled Blubrry PowerPress: version: 8.4.6, author: Blubrry (latest version: 8.4.7), Auto-updates disabled Books: version: 0.1.0, author: YOUR NAME HERE, Auto-updates disabled Border Control: version: 1.0.4, author: SMILE, Auto-updates disabled Brandon U - MIME Types: author: (undefined), version: 1.1.0, Auto-updates disabled Brandon U Admissions Tools: author: Greg Misener, version: (undefined), Auto-updates disabled Brandon U Asides: author: (undefined), version: 1.2.0, Auto-updates disabled Brandon U Cache Plugins: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U CJNS Issue List: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Columns: version: 0.1, author: Brandon University (original by Konstantin Kovshenin), Auto-updates disabled Brandon U Conference Theme Extensions: version: 1.1.0, author: Greg Misener and Patrick Johanneson, Auto-updates disabled Brandon U Contact Info: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Content Security Policy (CSP) Tools: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Department Promotion: version: 1.0.0, author: Greg Misener, Auto-updates disabled Brandon U Event Extension: author: (undefined), version: 1.2.0, Auto-updates disabled Brandon U Events Remote Calendar: author: (undefined), version: 1.1.0, Auto-updates disabled Brandon U Event Tools: version: 1.0.1, author: Patrick Johanneson, Auto-updates disabled Brandon U FreshService: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Global Sitemap: author: (undefined), version: (undefined), Auto-updates disabled Brandon U Gravity Form Extensions: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Gutenberg Mods: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Homecoming Tools: version: 1.1.0, author: Patrick Johanneson, Auto-updates disabled Brandon U Icon Bar: version: 1.0.0, author: Greg Misener, Auto-updates disabled Brandon U Jobs: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U LDAP Page Protection: author: (undefined), version: 1.1.0, Auto-updates disabled Brandon U Libcal Loader: version: 1.1.0, author: Patrick Johanneson, Auto-updates disabled Brandon U Library Tools: version: 1.0.0, author: Greg Misener, Auto-updates disabled Brandon U Menu Icons: version: 1.0.0, author: Greg Misener, Auto-updates disabled Brandon U Meteor Slides: author: (undefined), version: (undefined), Auto-updates disabled Brandon U Nav Walker: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Oasis: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Person CPT: author: (undefined), version: 1.1.0, Auto-updates disabled Brandon U Phone Numbers: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Research Connection CPTs: version: 1.0.0, author: Greg Misener, Auto-updates disabled Brandon U Residence Calculator: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Safety App: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Safety Notices Display: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Services: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Table Magic: author: (undefined), version: (undefined), Auto-updates disabled Brandon U Taxonomy Meta (test): author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U Tuition Calculator: author: (undefined), version: 2.0.0, Auto-updates disabled Brandon U Usernames: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U WooCommerce Bambora Fields: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U WooCommerce Modifications: author: (undefined), version: 1.0.0, Auto-updates disabled Brandon U XML Calendar Snarfer: author: (undefined), version: 1.0.0, Auto-updates disabled Broken Link Checker: version: 1.11.15, author: WPMU DEV, Auto-updates disabled BU Login (JS): author: (undefined), version: 1.0.0, Auto-updates disabled BU MiraPay: version: 2.0.1, author: Patrick Johanneson (Brandon University), Auto-updates disabled BU Parking Rates: version: 1.1.0, author: Greg Misener (and Patrick Johanneson, probably), Auto-updates disabled Canada Post Shipping For WooCommerce: version: 2.9.1, author: Small Fish Analytics Inc. (latest version: 2.9.2), Auto-updates disabled Classic Editor: version: 1.6, author: WordPress Contributors, Auto-updates disabled Custom Meta Boxes: version: 1.0.3, author: Human Made Limited, Auto-updates disabled Debug Bar: version: 1.1.2, author: wordpressdotorg, Auto-updates disabled Debug Bar ElasticPress: version: 1.4, author: 10up, Auto-updates disabled Duo Two-Factor Authentication: version: 2.5.7, author: Duo Security, Auto-updates disabled Edit Flow: version: 0.9.6, author: Daniel Bachhuber, Scott Bressler, Mohammad Jangda, Automattic, and others, Auto-updates disabled Email Address Encoder: version: 1.0.22, author: Till Krüss, Auto-updates disabled Eskobear Customizer Mods: author: (undefined), version: 1.0.0, Auto-updates disabled Fast User Switching: version: 1.4.9, author: Tikweb, Auto-updates disabled GP Conditional Pricing: version: 1.2.41, author: Gravity Wiz, Auto-updates disabled GP Limit Dates: version: 1.0.20, author: Gravity Wiz, Auto-updates disabled Gravity Forms Bambora (North America) Gateway (Advanced): version: 1.0.0, author: WP Gateways, Auto-updates disabled Gravity Forms Bambora Payments: author: (undefined), version: 1.1.0, Auto-updates disabled Gravity Forms CLI: version: 1.4, author: Rocketgenius, Auto-updates disabled Gravity Forms Simple Add-On: version: 2.1, author: Rocketgenius, Auto-updates disabled Gravity Perks: version: 2.1.9, author: Gravity Wiz, Auto-updates disabled Hello Dolly: version: 1.7.2, author: Matt Mullenweg, Auto-updates disabled HTTP Headers: version: 1.18.1, author: Dimitar Ivanov, Auto-updates disabled HTTP headers to improve web site security: version: 2.5.6, author: Carl Conrad, Auto-updates disabled Jetpack by WordPress.com: version: 9.2.1, author: Automattic (latest version: 9.3), Auto-updates disabled last updated: version: 2.1, author: Martin Wudenka, Auto-updates disabled Load HTML Page: author: (undefined), version: (undefined), Auto-updates disabled MapMyFitness: version: 0.1, author: MapMyFitness, Auto-updates disabled MCE Table Buttons: version: 3.3, author: Jake Goldman, 10up, Oomph, Auto-updates disabled Monkeyman Rewrite Analyzer: version: 1.0, author: Jan Fabry, Auto-updates disabled Network Plugin Auditor: version: 1.10.1, author: Katherine Semel, Auto-updates disabled Nozama Essentials: version: 1.1.0, author: The CSSIgniter Team, Auto-updates disabled Oasis Workflow Groups: version: 1.5, author: Nugget Solutions Inc., Auto-updates disabled Oasis Workflow Pro: version: 7.2, author: Nugget Solutions Inc., Auto-updates disabled Order / Coupon / Subscription Export Import Plugin for WooCommerce (BASIC): version: 1.7.2, author: WebToffee, Auto-updates disabled Order Delivery Date for WooCommerce (Lite version): version: 3.11.4, author: Tyche Softwares (latest version: 3.11.5), Auto-updates disabled Page Menu Editor: version: 3.1.0, author: Patrick Johanneson (forked from work by Sarah Anderson), Auto-updates disabled Pre-Publish Checklist: version: 1.1.1, author: Brainstorm Force, Auto-updates disabled PublishPress Checklists: version: 2.4.2, author: PublishPress, Auto-updates disabled Simple LDAP Login: version: 1.6.0, author: Clif Griffin Development Inc., Auto-updates disabled Simple Tableau Viz: version: 2.0, author: Gary Hukkeri, Auto-updates disabled Story Custom Post Type: version: 1.0.0, author: Patrick Johanneson, Auto-updates disabled Tabby Responsive Tabs: version: 1.2.3, author: cubecolour, Auto-updates disabled The Events Calendar: version: 5.3.1, author: Modern Tribe, Inc. (latest version: 5.3.1.1), Auto-updates disabled User Role Editor: version: 4.57.1, author: Vladimir Garagulya (latest version: 4.58.1), Auto-updates disabled WooCommerce: version: 4.8.0, author: Automattic (latest version: 4.9.0), Auto-updates disabled WooCommerce Bambora Gateway: version: 2.3.2, author: SkyVerge, Auto-updates disabled WooCommerce PDF Product Vouchers: version: 3.8.1, author: SkyVerge, Auto-updates disabled WP Google Maps: version: 8.0.23, author: WP Google Maps (latest version: 8.1.3), Auto-updates disabled WP Super Cache: version: 1.7.1, author: Automattic, Auto-updates disabled Yoast Duplicate Post: version: 3.2.6, author: Enrico Battocchi & Team Yoast (latest version: 4.0.1), Auto-updates disabled

wp-media

image_editor: WP_Image_Editor_Imagick imagick_module_version: 1690 imagemagick_version: ImageMagick 6.9.10-23 Q16 x86_64 20190101 https://imagemagick.org file_uploads: File uploads is turned off post_max_size: 8M upload_max_filesize: 2M max_effective_size: 2 MB max_file_uploads: 20 imagick_limits: imagick::RESOURCETYPE_AREA: 122 MB imagick::RESOURCETYPE_DISK: 1073741824 imagick::RESOURCETYPE_FILE: 6144 imagick::RESOURCETYPE_MAP: 512 MB imagick::RESOURCETYPE_MEMORY: 256 MB imagick::RESOURCETYPE_THREAD: 1 gd_version: 2.2.5 ghostscript_version: 9.50

wp-server

server_architecture: Linux 5.4.0-53-generic x86_64 httpd_software: Apache/2.4.41 (Ubuntu) php_version: 7.4.3 64bit php_sapi: apache2handler max_input_variables: 1000 time_limit: 30 memory_limit: 512M max_input_time: 60 upload_max_filesize: 2M php_post_max_size: 8M curl_version: 7.68.0 OpenSSL/1.1.1f suhosin: false imagick_availability: true pretty_permalinks: true htaccess_extra_rules: true

wp-database

extension: mysqli server_version: 8.0.22-0ubuntu0.20.04.3 client_version: mysqlnd 7.4.3

wp-constants

WP_HOME: undefined WP_SITEURL: undefined WP_CONTENT_DIR: /srv/www/vhosts/brandonu.local/wp-content WP_PLUGIN_DIR: /srv/www/vhosts/brandonu.local/wp-content/plugins WP_MAX_MEMORY_LIMIT: 512M WP_DEBUG: true WP_DEBUG_DISPLAY: false WP_DEBUG_LOG: true SCRIPT_DEBUG: true WP_CACHE: false CONCATENATE_SCRIPTS: false COMPRESS_SCRIPTS: undefined COMPRESS_CSS: undefined WP_LOCAL_DEV: undefined DB_CHARSET: utf8mb4 DB_COLLATE: undefined

wp-filesystem

wordpress: not writable wp-content: not writable uploads: writable plugins: not writable themes: not writable mu-plugins: not writable

`

brandwaffle commented 3 years ago

@pjohanneson it does look like you can apply the asciifolding analyzer to ignore accent marks, but that appears like it could cause confusion in certain circumstances. For example, el and él mean two different things. If you're looking to ignore the diacritical marks, I would recommend adding the analyzer via a filter. If you just want to make sure certain words match (especially ones in a language other than the language you have set for EP), you might consider just adding them as synonyms. Let me know your thoughts on either approach!

pjohanneson commented 3 years ago

I would recommend adding the analyzer via a filter.

I've been trying to do that, but I haven't managed to find the right filter. Is there a tutorial somewhere?

If you just want to make sure certain words match (especially ones in a language other than the language you have set for EP), you might consider just adding them as synonyms.

I'd like this to be as automated as possible, as we have a lot of different editors of varying technical ability. The vast majority of our site is in English, so other languages shouldn't pose a problem.

pjohanneson commented 3 years ago

I've found a solution:

add_filter( 'ep_config_mapping', function( $mappings ) {
    $mappings['settings']['analysis']['analyzer']['default']['filter'][] = 'asciifolding';
    return $mappings;
} );

This works, but please let me know if you think it's overly broad. (If you think it's fine, please close the issue.)

felipeelia commented 3 years ago

Hi @pjohanneson ,

Actually, that is a very good solution. Thanks for sharing!

tomisko6677 commented 3 years ago

I've found a solution:

add_filter( 'ep_config_mapping', function( $mappings ) {
    $mappings['settings']['analysis']['analyzer']['default']['filter'][] = 'asciifolding';
    return $mappings;
} );

This works, but please let me know if you think it's overly broad. (If you think it's fine, please close the issue.)

Hi @pjohanneson. I would like to ask you,where excactly shoud i add this filter definition. i have same issue with accent insensitive search and i cannot find way how to solve it. Thank you

pjohanneson commented 3 years ago

Hi @tomisko6677,

I added the filter in a plugin I was developing, which is activated on the WordPress site. You can also put the code snippet into your active theme's functions.php file, but remember, if you ever switch themes, you'll need to copy that code snippet to the new theme's functions.php.

tomisko6677 commented 3 years ago

Thank you so much @pjohanneson , i will try it.

gregsullivan commented 3 years ago

In case this is helpful for anyone else, I found asciifolding only worked when added before the ewp_snowball filter. (The code snippet above adds it as the last filter.) I'm not sure how my client's setup varies from the ones above where the snippet worked—possibly a different Elasticsearch version?

felipeelia commented 3 years ago

That is interesting, @gregsullivan. We've published today an article addressing that same problem but in a slightly different way: https://www.elasticpress.io/documentation/article/how-to-search-for-words-with-special-characters/

It will create new fields and add them to the search, so if you have a chance to try that and let us know how it goes, I'd appreciate it. Thanks in advance!

gregsullivan commented 3 years ago

Thanks for posting that link @felipeelia! Would you recommend one approach over the other? I understand that the approach described in this issue is broader, but that works for my use case. Are there performance concerns I might not be considering?

felipeelia commented 3 years ago

If I had to decide @gregsullivan, I'd go with the snippet. The solution we had earlier is too broad, @gregsullivan, and would affect all fields. Although that probably won't break anything, with the solution we have in that article, we would be (1) creating a new field (and keeping the original as is), and (2) affecting only some specific fields.

gregsullivan commented 3 years ago

That makes sense, thanks! In our case, we want to apply this to every searchable field, so it's working well for now. I'll keep an eye on it and double back to today's article if we encounter any issues. Thanks again!