Use a different analyser for full-text indexes

nacnudus commented 1 year ago

We believe that most users expect and need a search that:

stems
does not have stopwords
performs well with the English language

Available analysers

From call db.index.fulltext.listAvailableAnalyzers; with the stopwords truncated.

analyzer	description	stopwords
standard-folding	Analyzer that uses ASCIIFoldingFilter to remove accents (diacritics). Otherwise behaves as standard english analyzer. Note! This Analyser may have unexpected behaviour, such as tokenizing, for all non ASCII numbers and symbols.	[but, be, with
lithuanian	Lithuanian analyzer with stemming and stop word filtering.	[judviejų, to,
simple	A simple analyzer that tokenizes at non-letter boundaries. No stemming or filtering. Works okay for most European languages, but is terrible for languages where words are not separated by spaces, such as many Asian languages.	[]
latvian	Latvian analyzer with stemming and stop word filtering.	[varēšu, pār,
cjk	CJK - Chinese/Japanese/Korean - analyzer. Terms are normalised and case-folded. Produces bi-grams, and filters out stop words.	[but, be, with
sorani	Sorani Kurdish analyzer with stemming and stop word filtering.	[دەکات, لێ, ئە
stop	Stop analyzer tokenizes at non-letter characters, and filters out English stop words. This differs from the 'classic' and 'standard' analyzers in that it makes no effort to recognize special terms, like likely product names, URLs or email addresses.	[but, be, with
indonesian	Indonesian analyzer with stemming and stop word filtering.	[entahlah, ata
keyword	Keyword analyzer \tokenizes\ the text as a single term. Useful for zip-codes, ids, etc. Situations where complete and exact matches are desired.	[]
arabic	Arabic analyzer with light stemming, as specified by \Light Stemming for Arabic Information Retrieval.	[فان, او, اى, اي
standard	The standard analyzer. Tokenizes on non-letter and filters out English stop words and punctuation. Does no stemming, but takes care to keep likely product names, URLs and email addresses as single terms.	[but, be, with
galician	Galician analyzer with stemming and stop word filtering.	[deste, aos, m
german	German analyzer with stemming and stop word filtering.	[denn, daß, mu
unicode_whitespace	Breaks text into terms by characters that have the unicode WHITESPACE property.	[]
bulgarian	Bulgarian analyzer with light stemming, as specified by \Searching Strategies for the Bulgarian Language\, and stop word filtering.	[както, над, дор
brazilian	Brazilian Portuguese analyzer with stemming and stop word filtering.	[tua, deste, a
url	Tokenizes into sequences of alpha-numeric, numeric, URL, email, southeast asian terms, and into terms of individual ideographic and hiragana characters. English stop words are filtered out.	[but, be, with
basque	Basque analyzer with stemming and stop word filtering.	[beste, araber
english	English analyzer with stemming and stop word filtering.	[but, be, with
irish	Irish analyzer with stemming and stop word filtering.	[seacht, b', t
portuguese	Portuguese analyzer with stemming and stop word filtering.	[tua, tenho, t
url_or_email	Tokenizes into sequences of alpha-numeric, numeric, URL, email, southeast asian terms, and into terms of individual ideographic and hiragana characters. English stop words are filtered out.	[but, be, with
finnish	Finnish analyzer with stemming and stop word filtering.	[olisimme, he,
hindi	Hindi analyzer with stemming, normalization, and stop word filtering.	[एस, उन्हीं, ऐसे,
standard-no-stop-words	The default analyzer. Similar to the 'standard' analyzer, but filters no stop words. Tokenizes on non-letter boundaries filters out punctuation. Does no stemming, but takes care to keep likely product names, URLs and email addresses as single terms.	[]
catalan	Catalan analyzer with stemming and stop word filtering.	[s'han, sembla
danish	Danish analyzer with stemming and stop word filtering.	[sådan, de, di
norwegian	Norwegian analyzer with stemming and stop word filtering.	[hun, båe, hvi
russian	Russian analyzer with stemming and stop word filtering.	[еще, него, бы
thai	Thai analyzer with stop word filtering. It relies on the Java built-in localization support for the Thai locale in order to break apart and tokenize words, which might not be available depending on Java version and JRE vendor.	[ให้, แล้ว, เพรา
dutch	Dutch analyzer with stemming and stop word filtering.	[]
spanish	Spanish analyzer with stemming and stop word filtering.	[mucho, tuvies
swedish	Swedish analyzer with stemming and stop word filtering.	[då, deras, hu
cypher	An analyzer that is compatible with Cypher semantics for CONTAINS and ENDS WITH statements.	[]
armenian	Armenian analyzer with stemming and stop word filtering.	[մի, այն, ու,
greek	Greek analyzer with stemming and stop word filtering.	[αλλα, ειναι,
romanian	Romanian analyzer with stemming and stop word filtering.	[lui, fie, fii
hungarian	Hungarian analyzer with stemming and stop word filtering.	[újabb, most,
classic	Classic Lucene analyzer. Similar to 'standard', but with worse unicode support.	[but, be, with
turkish	Turkish analyzer with stemming and stop word filtering.	[bile, birşeyi
email	Tokenizes into sequences of alpha-numeric, numeric, URL, email, southeast asian terms, and into terms of individual ideographic and hiragana characters. English stop words are filtered out.	[but, be, with
italian	Italian analyzer with stemming and stop word filtering.	[lui, stavi, d
czech	Czech analyzer with stemming and stop word filtering.	[proč, pouze,
french	French analyzer with stemming and stop word filtering.	[lui, sera, da
whitespace	Breaks text into terms by characters that are considered \Java whitespace.	[]
persian	Persian analyzer. Tokenizes with zero-width non-joiner characters in addition to whitespace. Persian-specific variants, such as the farsi 'yeh' and 'keheh', are standardized. Simple stemming is accomplished via stop words.	[هايي, قابل, ن

maxf commented 1 year ago

The problem with keyword and cypher is that using them causes an error when building the index. As far as I've tested, english works the best, but performs badly on non-English terms, obviously.

nacnudus commented 1 year ago

https://github.com/neo4j/neo4j/issues/12979

nacnudus commented 1 year ago

There isn't a suitable analyser, and we're migrating the govgraphsearch backend to BigQuery, so this is no longer important.

alphagov / govuk-knowledge-graph-gcp

Use a different analyser for full-text indexes #287

Available analysers