cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

sql: default string collations of PG vs CockroachDB is different #102861

Open otan opened 1 year ago

otan commented 1 year ago

in pg, ordering seems to be case insensitive by default:

psql (14.6 (Homebrew), server 13.8)
Type "help" for help.

rdsdb=> select 'PH3710230411ax24302d-8a29-42' > 'PH3710230411TFR2304102009571';
 ?column?
----------
 f
(1 row)

but not in cockroach:

demo@127.0.0.1:26257/defaultdb> select 'PH3710230411ax24302d-8a29-42' > 'PH3710230411TFR2304102009571';
  ?column?
------------
     t
(1 row)

Time: 1ms total (execution 0ms / network 0ms)

but if you make it collate C in PG, it's "cockroach like":

replicationunload=> select 'PH3710230411ax24302d-8a29-42'::varchar(32) collate "C" > 'PH3710230411TFR2304102009571'::varchar(32) collate "C";
 ?column?
----------
 t
(1 row)

NOTE: on OSX, PG's default is C though, just to fool you - could only repro this in a linux env

This breaks verification tooling between PG <-> CockroachDB if we compare/iterate PKs in ascending order, such as AWS DMS's. The workaround is to make sure the collations match between the two sources.

Slack thread: https://cockroachlabs.slack.com/archives/CV581CE78/p1681424078522449

Jira issue: CRDB-27696

rafiss commented 1 year ago

on OSX, PG's default is C though, just to fool you - could only repro this in a linux env

If this is platform dependent in PG, should we be treating this as a CRDB bug? Specifically, I wonder if there are some Linux environments where the default collation also is C.

This is what PG docs say (https://www.postgresql.org/docs/current/collation.html):

24.2.2. Managing Collations

A collation is an SQL schema object that maps an SQL name to locales provided by libraries installed in the operating system. A collation definition has a provider that specifies which library supplies the locale data. One standard provider name is libc, which uses the locales provided by the operating system C library. These are the locales used by most tools provided by the operating system. Another provider is icu, which uses the external ICU library. ICU locales can only be used if support for ICU was configured when PostgreSQL was built.

A collation object provided by libc maps to a combination of LC_COLLATE and LC_CTYPE settings, as accepted by the setlocale() system library call. (As the name would suggest, the main purpose of a collation is to set LC_COLLATE, which controls the sort order. But it is rarely necessary in practice to have an LC_CTYPE setting that is different from LC_COLLATE, so it is more convenient to collect these under one concept than to create another infrastructure for setting LC_CTYPE per expression.) Also, a libc collation is tied to a character set encoding (see Section 24.3). The same collation name may exist for different encodings.

A collation object provided by icu maps to a named collator provided by the ICU library. ICU does not support separate “collate” and “ctype” settings, so they are always the same. Also, ICU collations are independent of the encoding, so there is always only one ICU collation of a given name in a database.

24.2.2.1. Standard Collations

On all platforms, the collations named default, C, and POSIX are available. Additional collations may be available depending on operating system support. The default collation selects the LC_COLLATE and LC_CTYPE values specified at database creation time. The C and POSIX collations both specify “traditional C” behavior, in which only the ASCII letters “A” through “Z” are treated as letters, and sorting is done strictly by character code byte values.

ryanluu12345 commented 1 month ago

Hi @rafiss , was just reading through the internal thread where you discussed this issue previously with Oliver and just wanted to see if there is any more context or investigation related to this matter? The reason I'm asking is that I think we are seeing another case of this issue, but it's related to the presence of special characters in the VARCHAR primary string (i.e. + is specified at multiple parts of the string). We're not exactly sure how these special characters are handled under the hood on the PG and CDRB sides.

Based on some local testing, I can see that on both Mac and Linux, setting PG source to collation "C" and leaving the default Cockroach UTF-8 en-US setting works fine and the data is sorted in the correct order, even when we do string key comparisons (i.e. < or >). But even using the default PG collation on Linux or Mac (not "C"), leads to comparison issues -- they're ordered differently even when I order by this PK ASC.