TYPO3-Solr / ext-solr

A TYPO3 extension that integrates the Apache Solr search server with TYPO3 CMS. dkd Internet Service GmbH is developing the extension. Community contributions are welcome. See CONTRIBUTING.md for details.
GNU General Public License v3.0
136 stars 247 forks source link

mail addresses are split by tokenizer #511

Open timohund opened 8 years ago

timohund commented 8 years ago

Mail addresses in content such as my.name@example.org are split by the StandardTokenizerFactory as "my.name" and "example.org" because according to http://unicode.org/reports/tr29/#Word_Boundaries certain punctuation between letter is not a word boundary. This causes "my.name" and "example.org" to show up in the autocomplete suggestions.

Alternative tokenizer: UAX29URLEmailTokenizerFactory this is similar to the StandardTokenizerFactory, but also recognizes URLs and mail addresses as entire tokens. This way the complete mail address would show up in the suggestions, which is more understandable to website users.

https://forge.typo3.org/issues/48990

timohund commented 8 years ago

Discussion: 1 Updated by Ingo Renner about 3 years ago

2 Updated by Bernhard Kraft over 2 years ago Comment Edit

You could configure an additional solr field and make it a copy of the "content" field. Then let this solr field get tokenized by the mentioned tokenizer and configure the suggest feature to take this solr field into account. 3 Updated by Jigal van Hemert over 2 years ago Comment Edit

That's a workaround. There is a suitable tokenizer available, so why not use it? That way all users of EXT:solr have understandable suggestions without having to tinker with the server configuration.

timohund commented 8 years ago

Estimation:

Effort: 5 Value: 5