EndPointCorp / end-point-blog

End Point Dev blog
https://www.endpointdev.com/blog/
17 stars 65 forks source link

Comments for Case Sensitive MySQL Searches #710

Open phinjensen opened 6 years ago

phinjensen commented 6 years ago

Comments for https://www.endpointdev.com/blog/2012/10/case-sensitive-mysql-searches/ By Brian Buchalter

To enter a comment:

  1. Log in to GitHub
  2. Leave a comment on this issue.
phinjensen commented 6 years ago
original author: Anonymous
date: 2012-10-18T14:36:00-04:00

Here's the documentation you're looking for: http://dev.mysql.com/doc/refman/5.5/en/charset-general.html

The sort order you're looking for is referred to as 'collation' in the manual.

If you use a case sensitive collation, you can get your first query to run without any modifications:

mysql [localhost] {msandbox} (test) > show create table contacts\G 1. row Table: contacts Create Table: CREATE TABLE contacts ( id int(11) NOT NULL AUTO_INCREMENT, first_name char(40) COLLATE latin1_general_cs DEFAULT NULL, PRIMARY KEY (id) ) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=latin1 COLLATE=latin1_general_cs 1 row in set (0.04 sec)

mysql [localhost] {msandbox} (test) > select * from contacts; +----+------------+ | id | first_name | +----+------------+ | 5 | aaron | | 6 | Alfred | +----+------------+ 2 rows in set (0.00 sec)

mysql [localhost] {msandbox} (test) > SELECT first_name FROM contacts WHERE first_name REGEXP '^[a-z]'; +------------+ | first_name | +------------+ | aaron | +------------+ 1 row in set (0.00 sec)

If you pick the collation for each table based on the queries you expect to execute more often, this shouldn't present a significant challenge.

Regards,

Fernando.

phinjensen commented 6 years ago
original author: Gregory Haase
date: 2012-10-18T16:59:38-04:00

I've seen too many apps trying to recover from case-sensitive search issues in the past to want to go there by default.

I've seen apps storing values wrapped in UPPER() so that they can wrap search terms in UPPER() and get a match

I've seen apps with function based indexes on UPPER(column_name).

And I've seen plenty of instances of: WHERE upper(search_term) = upper(column_name) where the developers can't understand why an index on column_name isn't being used.

I've also had to run my share of looking for records where UPPER(column_name) <> column_name to fix records that were stored incorrectly.

I'd rather have a collation I can set on the one table where I actually need case sensitive search - and have everything else work correctly by default.

phinjensen commented 6 years ago
original author: Brian Buchalter
date: 2012-10-18T17:35:18-04:00

@Fernando: Thanks for pointing out that documentation. It's certainly much clearly than some of the other options out there. I agree that hopefully, you are able to create tables with the character set and collate options you want, but often times we inherit systems which just went with MySQL's defaults. Such is life! But as I said, it is possible to change it. I'm sure there are some automated tools available for such things.

phinjensen commented 6 years ago
original author: Brian Buchalter
date: 2012-10-18T17:37:22-04:00

@Gregory: Yes, things can get ugly quick. There are also cases when developers bring massive amounts of data into application memory to do sorting and other manipulation best handled by the database. It's important to know the right tool for the job and how to use it.

phinjensen commented 6 years ago
original author: Anonymous
date: 2012-10-18T20:39:21-04:00

Brian,

To answer 'That was one of the few authoritative references I could find to a character set's "sort value", so perhaps there is a more technical term for it. I'd love to hear it.'

First try the PostgreSQL manual at http://www.postgresql.org/docs/9.2/static/collation.html then move on to http://dev.mysql.com/doc/refman/5.6/en/charset-general.html which explains:

"A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.

Suppose that we have an alphabet with four letters: “A”, “B”, “a”, “b”. We give each letter a number: “A” = 0, “B” = 1, “a” = 2, “b” = 3. The letter “A” is a symbol, the number 0 is the encoding for “A”, and the combination of all four letters and their encodings is a character set.

Suppose that we want to compare two string values, “A” and “B”. The simplest way to do this is to look at the encodings: 0 for “A” and 1 for “B”. Because 0 is less than 1, we say “A” is less than “B”. What we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): “compare the encodings.” We call this simplest of all possible collations a binary collation.

But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters “a” and “b” as equivalent to “A” and “B”; (2) then compare the encodings. We call this a case-insensitive collation. It is a little more complex than a binary collation. "

You might then move on to http://dev.mysql.com/doc/refman/5.6/en/adding-collation.html and finally http://dev.mysql.com/doc/refman/5.6/en/adding-collation-simple-8bit.html for a much more in-depth description of how to create the internals of a collation.

You can get no cost help from people familiar with MySQL in the #mysql channel. Similarly for the PG equivalent. People in either would have been able to help you quickly.

Your company may have its own chat channel somewhere and asking an expert there would probably also have got you a fast answer, because it's a normal part of introductory courses to cover this.

You might also consider taking a course or two, since it can save you a lot of time later. Or buy the MySQL Certification Study Guide and read it from cover to cover. It provides an excellent introduction to a broad range of topics and you might then consider going on to get certified.

Views are my own, for my employer's opinion, consult a PR person.

James Day, MySQL Senior Principal Support Engineer, Oracle

phinjensen commented 6 years ago
original author: Brian Buchalter
date: 2012-10-18T21:08:01-04:00

@James: Thanks for the links, I've updated my post. I used the phrase "sort value" because that's the language found in the Case Sensitivity in String Searches documentation.

Perhaps it would be helpful to update the document to provide the references to collation?