How to search on fields without diacritics / accent?

smiklosovic commented 7 years ago

Hey @ealonsodb

Lets say that I have my name saved in Cassandra which has some diacritics / accents in it: Štefan Miklošovič

I want to make such query that I would be found in DB even I make it like "Stefan Miklosovic" - it seems to me that the only viable way how to return myself is to use these diacritics characters but that is very cumbersome to do when you have some application which is used internationally.

CREATE KEYSPACE IF NOT EXISTS test_keyspace 
    WITH replication = { 'class': 'SimpleStrategy', 'replication_factor' : 1};

CREATE TABLE IF NOT EXISTS test_keyspace.accent (
    item text,
    id uuid,
    name text,
    lucene text,
    PRIMARY KEY (item)
);

DROP INDEX IF EXISTS test_keyspace.test_keyspace_accent_idx ;

CREATE CUSTOM INDEX IF NOT EXISTS test_keyspace_accent_idx ON test_keyspace.accent(lucene) 
USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = {
    'refresh_seconds': '1',
    'schema': '{
        default_analyzer: "english",
        fields: {
            name: {
                type: "string",
                case_sensitive: false
            }
        }
    }'
};

INSERT INTO test_keyspace.accent (item, id, name) 
    VALUES ( '1234',uuid(), 'Štefan Miklošovič');

-- finds one result as expected

SELECT * from test_keyspace.accent WHERE expr(test_keyspace_accent_idx, '
{
  "filter":[
    {
      "type":"boolean",
      "must":[
        {
          "type":"match",
          "field":"name",
          "value": "Štefan Miklošovič"
        }
      ]
    }
  ]
}
');

-- finds nothing without accent in a query

SELECT * from test_keyspace.accent WHERE expr(test_keyspace_accent_idx, '
{
  "filter":[
    {
      "type":"boolean",
      "must":[
        {
          "type":"match",
          "field":"name",
          "value": "Stefan Miklosovic"
        }
      ]
    }
  ]
}
');

-- finds nothing

SELECT * from test_keyspace.accent WHERE expr(test_keyspace_accent_idx, '
{
  "filter":[
    {
      "type":"boolean",
      "must":[
        {
          "type":"phrase",
          "field":"name",
          "value": "Štefan Miklošovič"
        }
      ]
    }
  ]
}
');

-- finds nothing

SELECT * from test_keyspace.accent WHERE expr(test_keyspace_accent_idx, '
{
  "filter":[
    {
      "type":"boolean",
      "must":[
        {
          "type":"wildcard",
          "field":"name",
         "value": "Štefan Miklošovič"
        }
      ]
    }
  ]
}
');

ealonsodb commented 7 years ago

Hi @smiklosovic:

You are using the english analyzer for non-english text. Caron is a diacritic that is not included in english alphabet. Use the correct analyzer for your language.

smiklosovic commented 7 years ago

@ealonsodb

I do not know what letters will be used in our application in advance. When I choose analyzer which supports caron, I would miss other possible letters from different languages not supported by it. Is there any way how to support "everything"? How would you solve this?

ealonsodb commented 7 years ago

Hi @smiklosovic:

You are using a string mapper that does not analyze the input text. You should use the text mapper instead.

There is no such analyzer that could do everything. Each language has its own characteristics (alphabet, stopwords, delimiters.. ) and sometimes have conflicts between them. (the term 'a' could be a stopword in english but a valid and correct word in other language)

You can create a mapper for every diferent language and performs searches against a bunch of mappers

CREATE KEYSPACE IF NOT EXISTS test_keyspace 
    WITH replication = { 'class': 'SimpleStrategy', 'replication_factor' : 1};

CREATE TABLE IF NOT EXISTS test_keyspace.accent (
    item text,
    id uuid,
    name text,
    lucene text,
    PRIMARY KEY (item)
);

DROP INDEX IF EXISTS test_keyspace.test_keyspace_accent_idx ;

CREATE CUSTOM INDEX IF NOT EXISTS test_keyspace_accent_idx ON test_keyspace.accent(lucene) 
USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = {
    'refresh_seconds': '1',
    'schema': '{
        fields: {
            name_english: { type: "text", case_sensitive: false, column: "name", analyzer: "english"},
            name_brazilian: { type: "text", case_sensitive: false, column: "name", analyzer: "brazilian"},
            name_finnish: { type: "text", case_sensitive: false, column: "name", analyzer: "finnish"},
            name_hungarian: { type: "text", case_sensitive: false, column: "name", analyzer: "hungarian"}
            ...
        }
    }'
};

SELECT * from test_keyspace.accent WHERE expr(test_keyspace_accent_idx, '{
    "filter":[
        {
            "type":"boolean",
            "should": [
                {
                    "type":"match",
                    "field":"name_english",
                    "value": "Štefan Miklošovič"
                },
                {
                    "type":"match",
                    "field":"name_brazilian",
                    "value": "Štefan Miklošovič"
                },
                {
                    "type":"match",
                    "field":"name_finnish",
                    "value": "Štefan Miklošovič"
                },

                {
                    "type":"match",
                    "field":"name_hungarian",
                    "value": "Štefan Miklošovič"
                },

                ....
            ]
        }
    ]
}');

With this approach you will have false positives.

Also, you can code your own analyzer, include it in classpath and reference it with a classpath analyzer type.

Hope this helps

smiklosovic commented 7 years ago

@ealonsodb

Thanks!

Two questions.

a) Why would I have false positives?

b) what about computational efficiency? Do not I waste a lot of resources here by specifying multiple analyzers? What is the overhead?

All I am basically asking for is to be able to match results regardless of accents on top of that so when I search for "Stefan Miklosovic", it would give me a record with "Štefan Miklošovič". Would that be possible?

ealonsodb commented 7 years ago

Hi @smiklosovic :

There is one way you can get this running. You need to develop a custom Analyzer formed by a MappingCharFilter and a WhitespaceAnalyzer.

With the MappingCharFilter you can replace any character with others.

For example, the mapping table for diacritics in spanish is:

á => a, é => e, í => i, ó => o, ú => u

With the WhitespaceTokenizer text is splitted in tokens by whitespace character.

You can read further instructions about how to generate and use a custom analyzer at #231, By now, you need to write code to use this custom analyzer feature. We are working to update this feature to be able to create custom analyzers in index creation query.

This is a feature included in ElasticSearch also. Maybe you should read its documentation about custom analyzers to undertand better the lucene analysis pipeline .

Also, you can always ask for consultancy services writing to contact@stratio.com

Regards

smiklosovic commented 6 years ago

Any progress on this or am I still forced to follow the path mentioned in the comment of @ealonsodb ?

smiklosovic commented 6 years ago

@ealonsodb

This analyzer did that trick for me. ASCIIFoldingFilter converts all characters to their ascii doc equivalent.

public class AccentedAnalyzer extends Analyzer {

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {

        final Tokenizer source = new WhitespaceTokenizer();

        final TokenStream result = new StandardFilter(source);

        final LowerCaseFilter lowerCaseFilter = new LowerCaseFilter(result);

        final ASCIIFoldingFilter asciiFilter = new ASCIIFoldingFilter(lowerCaseFilter);

        return new TokenStreamComponents(source, asciiFilter);
    }
}

gjabelAmbitas commented 5 years ago

hi! Any news about this problem? Is it still impossible to do it without creating custom analyzer?

smiklosovic commented 5 years ago

@gjabelAmbitas in the end, we created one more field into which we saved e.g. last name without diacritics and we are performing searches only on this "dumb" field. So whatever you enter, it is dumbed-down to remove all diacritics (there are libs for that, core java has it too) and we perform search on that string against that dumb column. We return "with diacritics" column to user back. This way, even you search without diacritics, you will get them back in results.

By this approach, you can do wildcard queries too and use ?, *, . and so on ... The only case "it does not work" is if you want to search with diacritics, for example I am "Miklošovič" and I want to return ONLY people with name "Miklošovic". I would return my full name instead. But you have to ask yourself if that scenario is ever going to happen. Everybody enters your name upon searching without diacritics anyway.

For transforming all existing names to this dumb column, we used Apache Spark and its Cassandra connector.

Stratio / cassandra-lucene-index

How to search on fields without diacritics / accent? #317