searching for a string should ingore case and diacritic accents

Zendro-dev / graphql-server

Skeleton NodeJS project for a graphQL server.

GNU General Public License v3.0

0 stars 1 forks source link

searching for a string should ingore case and diacritic accents #73

Open AliciaMstt opened 3 years ago

AliciaMstt commented 3 years ago

Ignore case

Currently searching for a string is case sensitive, so that in the example below in the SPA searching for "chaya" yields no results:

Whilst "Chaya" does:

The desired behavior is that "chaya" "Chaya", "chAya" or any similar form should bring the same results than "Chaya".

Accents We have a similar issue with the special character ´ and others, which in Spanish are used all the time. Unfortunately even if we know that the correct spelling is maíz, users commonly would look for maiz. Therefore we need that searching for a string to ignore accents of anykind, something like:

a == âàáäÂÀÂÄ 
e == êèéëÊÈÉË 
i == îïíÎÏ 
o == ôöóÔÖ 
u == ûùüÛÜÙ 
c == çÇ 
n == ñ

So that if maíz is present in the db, the query maiz and maíz should yield the same results.

asishallab commented 3 years ago

A problem with a general solution is that the like operator is not implemented in all storage types. MongoDb and Neo4J only support regular expressions, which, of course, could be used to implement that behavior in these storage techs. Additionally the ilike operator which is case in-sensitive is only implemented in Postgres. Furthermore, we could not use regular expressions in all storages, because regexs are not implemented in Presto / Trino and Amazon S3 tables.

Thus, we'd need to first identify

What could we use in Presto / Trino?
What could we use in Amazon S3?
What could we use in all relational DBs?
What could we use in MongoDb and Neo4J, probably regexs?

Then we implement a helper module that constructs the respective storage type correct search and executes it in its respective data model layer. We could even (thanks @coeit ) implement a new operator like zilike (Zendro - ilike) and implement that for each storage separately.

The main question is to identify a common denominator of search function that can be implemented in all storage techs, but Cassandra. This question can be answered using something like the following hypothetical example table, inspired by Wikipedia's tech-tables:	Storage-Technology	Regular Expression Operator	regexp supports case-insensitive	like Operator	ilike Operator
Sequelize	`regexp`	?	true	only postgres

coeit commented 3 years ago

Storage-Technology	Regular Expression Operator	regexp supports case-insensitive	like Operator	ilike Operator	resource
Sequelize	`regexp` (~, PG/MySQL only)	`iRegexp` (~*, PG only)	`like`	`ilike` (PG only)	https://sequelize.org/master/manual/model-querying-basics.html#operators
MongoDB	`$regex`	`$options: 'i'`	can be implemented with regex	can be implemented with regex	https://docs.mongodb.com/manual/reference/operator/query-evaluation/
neo4j	`=~`	`(?i)<regexp>`	can be implemented with regex	can be implemented with regex	https://neo4j.com/docs/cypher-manual/current/clauses/where/#query-where-string
presto / trino	`regexp_like(<regexp>, <string>)`	`regexp_like((?i)<regexp>, <string>)`	`LIKE`	`LOWER(<string>) LIKE` OR can be implemented with regex	https://trino.io/docs/current/functions.html
amazonS3	-	-	`LIKE`	`LOWER(<string>) LIKE`	https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-operators.html#s3-glacier-select-sql-reference-pattern

coeit commented 3 years ago

Since regular expressions have quite a different syntax depending on the storage type when it comes to setting flags (for Zendro we need case insensitivity), a consistent way to define those flags in Zendro has to be defined.

Sequelize: iRegexp
mongoDB: $options: 'i'
neo4j / presto / trino: (?i)

Solutions

`iRegexp` operator

In the spirit of Sequelize, define a iRegexp operator that implements the i flag depending on the storage type.

Do we need notiRegexp ?

Define a Zendro specific regexp syntax

We can use e,g, Java regular expressions syntax like neo4j/ presto and trino do. We would have to translate this for mongoDB and sequelize to their specific implementations

asishallab commented 3 years ago

Response to comment about `iRegexp`

Implement the Zendro operator iRegexp and notiRegexp. It seems to be the best solution

asishallab commented 3 years ago

In order to resolve this issue we should first implement a basic solution and use the ilike operator in all storages, Zendro currently supports. This will avoid having errors if one Zendro instance uses Amozon S3. In a later iteration we could think about implementing a new zilike operator that actually uses regular expressions to match e.g. "a" with all variations of the letter with accents and such, in Amazon S3 these searches would be done with its internal ilike and the _ wild-card. zilike would thus have different behavior in different storage types.

Zendro-dev / graphql-server

searching for a string should ingore case and diacritic accents #73

Solutions

iRegexp operator

Define a Zendro specific regexp syntax

Response to comment about iRegexp

`iRegexp` operator

Response to comment about `iRegexp`