alphagov / govuk-knowledge-graph-gcp

GOV.UK content data and cloud infrastructure for the GovSearch app.
https://docs.data-community.publishing.service.gov.uk/tools/govgraph/
MIT License
8 stars 1 forks source link

Use a different analyser for full-text indexes #287

Closed nacnudus closed 1 year ago

nacnudus commented 1 year ago

We believe that most users expect and need a search that:

Available analysers

From call db.index.fulltext.listAvailableAnalyzers; with the stopwords truncated.

analyzer description stopwords
standard-folding Analyzer that uses ASCIIFoldingFilter to remove accents (diacritics). Otherwise behaves as standard english analyzer. Note! This Analyser may have unexpected behaviour, such as tokenizing, for all non ASCII numbers and symbols. [but, be, with
lithuanian Lithuanian analyzer with stemming and stop word filtering. [judviejų, to,
simple A simple analyzer that tokenizes at non-letter boundaries. No stemming or filtering. Works okay for most European languages, but is terrible for languages where words are not separated by spaces, such as many Asian languages. []
latvian Latvian analyzer with stemming and stop word filtering. [varēšu, pār,
cjk CJK - Chinese/Japanese/Korean - analyzer. Terms are normalised and case-folded. Produces bi-grams, and filters out stop words. [but, be, with
sorani Sorani Kurdish analyzer with stemming and stop word filtering. [دەکات, لێ, ئە
stop Stop analyzer tokenizes at non-letter characters, and filters out English stop words. This differs from the 'classic' and 'standard' analyzers in that it makes no effort to recognize special terms, like likely product names, URLs or email addresses. [but, be, with
indonesian Indonesian analyzer with stemming and stop word filtering. [entahlah, ata
keyword Keyword analyzer \tokenizes\ the text as a single term. Useful for zip-codes, ids, etc. Situations where complete and exact matches are desired. []
arabic Arabic analyzer with light stemming, as specified by \Light Stemming for Arabic Information Retrieval. [فان, او, اى, اي
standard The standard analyzer. Tokenizes on non-letter and filters out English stop words and punctuation. Does no stemming, but takes care to keep likely product names, URLs and email addresses as single terms. [but, be, with
galician Galician analyzer with stemming and stop word filtering. [deste, aos, m
german German analyzer with stemming and stop word filtering. [denn, daß, mu
unicode_whitespace Breaks text into terms by characters that have the unicode WHITESPACE property. []
bulgarian Bulgarian analyzer with light stemming, as specified by \Searching Strategies for the Bulgarian Language\, and stop word filtering. [както, над, дор
brazilian Brazilian Portuguese analyzer with stemming and stop word filtering. [tua, deste, a
url Tokenizes into sequences of alpha-numeric, numeric, URL, email, southeast asian terms, and into terms of individual ideographic and hiragana characters. English stop words are filtered out. [but, be, with
basque Basque analyzer with stemming and stop word filtering. [beste, araber
english English analyzer with stemming and stop word filtering. [but, be, with
irish Irish analyzer with stemming and stop word filtering. [seacht, b', t
portuguese Portuguese analyzer with stemming and stop word filtering. [tua, tenho, t
url_or_email Tokenizes into sequences of alpha-numeric, numeric, URL, email, southeast asian terms, and into terms of individual ideographic and hiragana characters. English stop words are filtered out. [but, be, with
finnish Finnish analyzer with stemming and stop word filtering. [olisimme, he,
hindi Hindi analyzer with stemming, normalization, and stop word filtering. [एस, उन्हीं, ऐसे,
standard-no-stop-words The default analyzer. Similar to the 'standard' analyzer, but filters no stop words. Tokenizes on non-letter boundaries filters out punctuation. Does no stemming, but takes care to keep likely product names, URLs and email addresses as single terms. []
catalan Catalan analyzer with stemming and stop word filtering. [s'han, sembla
danish Danish analyzer with stemming and stop word filtering. [sådan, de, di
norwegian Norwegian analyzer with stemming and stop word filtering. [hun, båe, hvi
russian Russian analyzer with stemming and stop word filtering. [еще, него, бы
thai Thai analyzer with stop word filtering. It relies on the Java built-in localization support for the Thai locale in order to break apart and tokenize words, which might not be available depending on Java version and JRE vendor. [ให้, แล้ว, เพรา
dutch Dutch analyzer with stemming and stop word filtering. []
spanish Spanish analyzer with stemming and stop word filtering. [mucho, tuvies
swedish Swedish analyzer with stemming and stop word filtering. [då, deras, hu
cypher An analyzer that is compatible with Cypher semantics for CONTAINS and ENDS WITH statements. []
armenian Armenian analyzer with stemming and stop word filtering. [մի, այն, ու,
greek Greek analyzer with stemming and stop word filtering. [αλλα, ειναι,
romanian Romanian analyzer with stemming and stop word filtering. [lui, fie, fii
hungarian Hungarian analyzer with stemming and stop word filtering. [újabb, most,
classic Classic Lucene analyzer. Similar to 'standard', but with worse unicode support. [but, be, with
turkish Turkish analyzer with stemming and stop word filtering. [bile, birşeyi
email Tokenizes into sequences of alpha-numeric, numeric, URL, email, southeast asian terms, and into terms of individual ideographic and hiragana characters. English stop words are filtered out. [but, be, with
italian Italian analyzer with stemming and stop word filtering. [lui, stavi, d
czech Czech analyzer with stemming and stop word filtering. [proč, pouze,
french French analyzer with stemming and stop word filtering. [lui, sera, da
whitespace Breaks text into terms by characters that are considered \Java whitespace. []
persian Persian analyzer. Tokenizes with zero-width non-joiner characters in addition to whitespace. Persian-specific variants, such as the farsi 'yeh' and 'keheh', are standardized. Simple stemming is accomplished via stop words. [هايي, قابل, ن
maxf commented 1 year ago

The problem with keyword and cypher is that using them causes an error when building the index. As far as I've tested, english works the best, but performs badly on non-English terms, obviously.

nacnudus commented 1 year ago

https://github.com/neo4j/neo4j/issues/12979

nacnudus commented 1 year ago

There isn't a suitable analyser, and we're migrating the govgraphsearch backend to BigQuery, so this is no longer important.