Open otan opened 1 year ago
on OSX, PG's default is C though, just to fool you - could only repro this in a linux env
If this is platform dependent in PG, should we be treating this as a CRDB bug? Specifically, I wonder if there are some Linux environments where the default collation also is C
.
This is what PG docs say (https://www.postgresql.org/docs/current/collation.html):
24.2.2. Managing Collations
A collation is an SQL schema object that maps an SQL name to locales provided by libraries installed in the operating system. A collation definition has a provider that specifies which library supplies the locale data. One standard provider name is libc, which uses the locales provided by the operating system C library. These are the locales used by most tools provided by the operating system. Another provider is icu, which uses the external ICU library. ICU locales can only be used if support for ICU was configured when PostgreSQL was built.
A collation object provided by libc maps to a combination of LC_COLLATE and LC_CTYPE settings, as accepted by the setlocale() system library call. (As the name would suggest, the main purpose of a collation is to set LC_COLLATE, which controls the sort order. But it is rarely necessary in practice to have an LC_CTYPE setting that is different from LC_COLLATE, so it is more convenient to collect these under one concept than to create another infrastructure for setting LC_CTYPE per expression.) Also, a libc collation is tied to a character set encoding (see Section 24.3). The same collation name may exist for different encodings.
A collation object provided by icu maps to a named collator provided by the ICU library. ICU does not support separate “collate” and “ctype” settings, so they are always the same. Also, ICU collations are independent of the encoding, so there is always only one ICU collation of a given name in a database.
24.2.2.1. Standard Collations
On all platforms, the collations named default, C, and POSIX are available. Additional collations may be available depending on operating system support. The default collation selects the LC_COLLATE and LC_CTYPE values specified at database creation time. The C and POSIX collations both specify “traditional C” behavior, in which only the ASCII letters “A” through “Z” are treated as letters, and sorting is done strictly by character code byte values.
Hi @rafiss , was just reading through the internal thread where you discussed this issue previously with Oliver and just wanted to see if there is any more context or investigation related to this matter? The reason I'm asking is that I think we are seeing another case of this issue, but it's related to the presence of special characters in the VARCHAR primary string (i.e. +
is specified at multiple parts of the string). We're not exactly sure how these special characters are handled under the hood on the PG and CDRB sides.
Based on some local testing, I can see that on both Mac and Linux, setting PG source to collation "C" and leaving the default Cockroach UTF-8 en-US setting works fine and the data is sorted in the correct order, even when we do string key comparisons (i.e. < or >
). But even using the default PG collation on Linux or Mac (not "C"), leads to comparison issues -- they're ordered differently even when I order by this PK ASC.
in pg, ordering seems to be case insensitive by default:
but not in cockroach:
but if you make it
collate C
in PG, it's "cockroach like":NOTE: on OSX, PG's default is C though, just to fool you - could only repro this in a linux env
This breaks verification tooling between PG <-> CockroachDB if we compare/iterate PKs in ascending order, such as AWS DMS's. The workaround is to make sure the collations match between the two sources.
Slack thread: https://cockroachlabs.slack.com/archives/CV581CE78/p1681424078522449
Jira issue: CRDB-27696