Problems with sort order (UTF8 locales don't work)

carlosfrodriguez commented 10 years ago

Hello,

The pg_config points to a local directory that does not exists

LOCALEDIR = /Applications/Postgres.app/Contents/Versions/9.3/share/locale

Can you include it on further versions?

((enjoy)) cr

jakob commented 10 years ago

Could you elaborate what exactly you expect to find in this directory? Do you want localized server messages?

carlosfrodriguez commented 10 years ago

Hi @jakob

I'm sorry I'm not sure, Im having problems with locales in PostgreSQL and i saw that this directory is missing, so i tough that could be the reason.

((enjoy)) cr

jakob commented 10 years ago

I think this directory is only used for localized server messages.

Locales are provided by the system. PostgreSQL usually chooses your default locale when initialising the database cluster. You can choose a different locale by manually calling initdb:

initdb -D DATA_DIRECTORY -EUTF-8 --locale=XX_XX.UTF-8

jakob commented 10 years ago

You can type locale -a to list available locales on your system.

tbussmann commented 10 years ago

Im having problems with locales in PostgreSQL

what kind of problems do you have? If it's about a strange ordering, like I encountered, you will have to realize that collations will not work on any BSD-ish OS (incl. OSX) for an UTF8 encoding. You will never get the sorting right, unfortunately.

jakob commented 10 years ago

@tbussmann I just tried this, and I am shocked, but it seems you are right. German strings are sorted incorrectly with a locale setting of de_DE.UTF-8. The only way I could get PostgreSQL to sort the strings correctly, was to create a database that uses latin1 as character encoding:

create database latin1_test WITH encoding 'LATIN1' LC_COLLATE='de_DE.ISO8859-1' LC_CTYPE='de_DE.ISO8859-1' TEMPLATE=template0

Then the strings are sorted correctly. However, using latin1, it is no longer possible to include foreign characters in the table... this is ridiculous. Is this behaviour documented anywhere?

jakob commented 10 years ago

And I also just found out that the sort command in the Terminal has the same problem. Does this mean that BSD systems don't support sorting UTF-8 strings at all?

jakob commented 10 years ago

The solution seems to be that PostgreSQL should use ICU rather the BSD functions for sorting. See this wiki page: https://wiki.postgresql.org/wiki/Todo:ICU

Possibly I can build PostgreSQL using a patch made by Palle Girgenson for FreeBSD, but I'm not sure how well-supported this patch is: http://people.freebsd.org/~girgen/postgresql-icu/README.html

tbussmann commented 10 years ago

you can see the reason for this with ls -l /usr/share/locale/de_DE.UTF-8 you see that LC_COLLATE only symlinks to la_LN.US-ASCII. You get the same if you sort something on the shell, so it's a OS specific, not PG specific problem. AFAIR this affects all BSD OS. Some ML posts of Tom Lane in that topic: http://www.postgresql.org/message-id/16510.1263450305@sss.pgh.pa.us http://www.postgresql.org/message-id/22721.1264203310@sss.pgh.pa.us http://www.postgresql.org/message-id/23053.1337036410@sss.pgh.pa.us It seems we will have to wait for http://wiki.postgresql.org/wiki/Todo:ICU , use a different OS or use a function for sorting, that can internally use ICU form pl/perl, unaccent contrib, or your own implementation.

To me this is the biggest drawback of using PostgresApp (PostgreSQL on OSX in general) and something that should be clearly highlighted in the documentation.

tbussmann commented 10 years ago

funny, you were quicker while I was still googleing and typing :) didn't know about the FreeBSD patch existing. Would be great if you can give it a try!

jakob commented 10 years ago

Thanks for your explanations. I'll try building with the patch. Don't have time right now, but I'll post here when I know more.

macarthy commented 9 years ago

So if I want to add thai TH locale to do Thai string sorting in postgresapp, I need to get the TH locale installed to OSX, so that is listed in locale -a

Is that correct?

jakob commented 9 years ago

@macarthy in principle, yes. You'd just need to create a new database with the collation. However, as a word of caution, UTF-8 locales seem to be fundamentally broken on OSX. Postgres uses the strcoll API, which unfortunately does not support multibyte encodings on OSX.

macarthy commented 9 years ago

Looks like strcoll is a mess on OsX. Thanks for your work.

jakob commented 9 years ago

I have built the 9.4 release candidate with the patch from Palle Girgensohn. I also had to bundle ICU, which is rather large, but now sorting seems to work as expected!

If you encountered issues with text sorting, please download the latest prerelease and see if it works: https://github.com/PostgresApp/PostgresApp/releases/tag/9.4rc1

I'd really appreciate feedback; if everything works I'll include the patch in the final release of Postgresapp 9.4.

thaJeztah commented 9 years ago

Wondering; is the built-in version of PostgreSQL on OS X also patched with ICU? Perhaps a ticket should be created in the OS X issue tracker?

jakob commented 9 years ago

After reading this blog post i came to the conclusion that Apple doesn't care the least about their PostgreSQL installation. Feel free to file a radar, but I don't feel like wasting my time reporting issues that will probably be ignored.

You can check if PostgreSQL is build with the ICU patch by running pg_config (look for the --with-icu flag in the configure parameters).

Or you can try sorting text. With my locale (de_AT.UTF-8) I run the following query:

SELECT unnest(array['a','ä','b']) ORDER BY 1;

On a broken system, this will return

a
b
ä

With ICU it will be sorted correctly:

a
ä
b

macarthy commented 9 years ago

Quick question. if I rename this build PostgresWithICU.app, and run it it will use the previous I was using with Postgres.app DBs etc ? Or do I need to move the data dirs etc.

I'd like to try this out with Thai and Lao sorting.

Thanks

On 24 November 2014 at 14:09, Jakob Egger notifications@github.com wrote:

After reading this blog post http://blog.2ndquadrant.com/ware-yosemite-possible-postgresql-upgrade-issues-os-x-10-10/ i coame to the conclusion that Apple doesn't care the least about their PostgreSQL installation. Feel free to file a radar, but I don't feel like wasting my time reporting issues that will probably be ignored.

You can check if PostgreSQL is build with the ICU patch by running pg_config (look for the --with-icu flag in the configure parameters).

Or you can try sorting text. With my locale (de_AT.UTF-8) I run the following query:

SELECT unnest(array['a','ä','b']) order by 1;

On a broken system, this will return

a b ä

With ICU it will be sorted correctly:

a ä b

Reply to this email directly or view it on GitHub https://github.com/PostgresApp/PostgresApp/issues/216#issuecomment-64157627 .

jakob commented 9 years ago

Postgres.app always needs to be named "Postgres.app" otherwise it won't find the shared libraries. However, if you start it, it will offer to rename the old app automatically.

You should really dump & restore the database because all the indices will be incorrect due to the changed sort order. If you've previously been using 9.3, you have to do that anyway, and you can use both versions in parallel, the new version will create a new database for 9.4 in a separate directory.

If you've already been on one of the 9.4 betas, you'll have to rename the data directory before switching Postgres.app.

thaJeztah commented 9 years ago

After reading this blog post i came to the conclusion that Apple doesn't care the least about their PostgreSQL installation. Feel free to file a radar, but I don't feel like wasting my time reporting issues that will probably be ignored.

Thanks for that link, that's quite nasty (even though it wasn't a "supported" approach). And agreed, activity on the developer-forums is quite low and little to no response from Apple employees.

jakob commented 9 years ago

Unfortunately including this patch isn't as straightforward a decision as I thought. I'd like to invite anybody affected by this issue to discuss whether to include the patch in the 9.4 release in issue #233.

@macarthy any feedback yet on how the new build works with thai sorting?

Chocksy commented 9 years ago

@jakob any solution to this using the stable version instead of the release candidate??

jakob commented 9 years ago

I'm just preparing the 9.4.2 builds. I'll see if I can get a build with ICU working.

Chocksy commented 9 years ago

Thanks @jakob

orbatec commented 8 years ago

I don't quite understand why the correct order should be a ä b .... when I look at the codepoint values in UTF-8 for those characters, I see the values 97 228 98 .... Sorting them in ascending order results in 97 98 228 or a b ä

What am I missing?

--Quote-- Or you can try sorting text. With my locale (de_AT.UTF-8) I run the following query:

SELECT unnest(array['a','ä','b']) ORDER BY 1; On a broken system, this will return

a b ä With ICU it will be sorted correctly:

a ä b --End Quote--

jakob commented 8 years ago

@orbatec What you describe is the expected behavior when using the POSIX locale. The locale defines how text should be sorted. In the Austrian locale, the correct sort order is a ä b. However, on OS X, strcoll is broken and therefore PostgreSQL ignores the locale and just sorts the strings by comparing their UTF-8 representation (ie. it effectively always uses the POSIX locale).

Also, the codepoint 228 is just one possible representation of ä. You could also represent it as 97 776, which is an a followed by the combining character ¨. These two representations would be sorted differently in the POSIX locale, but for the Austrian locale it shouldn't matter which codepoints are chosen to represent a letter.

Please also note that there is a difference between unicode code points and their encoding (eg. UTF-8).

Text handling is a very complex topic. If you are interested, I recommend to spend a few days reading about Unicode, code points, characters, graphemes, encodings, normal forms and collations. It will change the way you think about text.

orbatec commented 8 years ago

@jakob Thanks! Yep...just spent an hour reading up on this and by god...why do we always make things so complex :-) Unfortunately I am working on Mac with Postgres 9.4 ... my colleague has found a "patch" for 9.4 (acually rc01 or rc02) that seems to fix the behaviour. How come an rcxx fixes something that doesn't seem fixed in the final postgres? (sorry, when I see fix, I mean workaround for a mac os problem)

jakob commented 8 years ago

@orbatec There has been a lot of discussion about the ICU patch here on Github and on the PostgreSQL hackers mailing list. The conclusion was that I didn't want the default build of PostgreSQL to use an unofficial patch. Including separate builds of PostgreSQL with ICU is a lot of work, and also dangerous, since people switching between the two builds might corrupt their database.

There has been discussion on the mailing list about adding official support for ICU in 9.6; I hope this will fix the issue for good.

tbussmann commented 7 years ago

now that PostgreSQL 10 is approaching, I'd like to get back to this topic.

In the recent cycle patches landed upstream which support icu as an alternative collation provider (thus different from the FreeBSD patch discussed and experimented with previously in 9.4rc01/02). I have done continuous tests with Git master and beta builds with active ICU support on macOS. As far as I see Postgres.app 2.1beta1 does also include this. My tests were quite promising. Unfortunately at the current state the icu-provided collations cannot be used as cluster or database default collation (LC_COLLATE). So, to make use of it, users of PostgresApp either need to attach a COLLATE clause on the fly to each comparison and order statement, or for a easier use to columns, indexes or domains. To my tests this works nicely. With the example from above:

# SELECT unnest(array['a','ä','b']) COLLATE "de-AT-u-co-phonebk-x-icu" ORDER BY 1;
┌────────┐
│ unnest │
├────────┤
│ a      │
│ ä      │
│ b      │
└────────┘

To make this a bit more clear, I'd recommend to add a note in documentation and I'd consider to explicitly stating the missing default collation with a --lc-collate=C in initdb in initDatabaseSync which is actually is:

$ ls -l /usr/share/locale/de_AT.UTF-8/LC_COLLATE
lrwxr-xr-x  1 root  wheel  28 27 Jan 22:48 /usr/share/locale/de_AT.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE

this would likely avoid further confusion and would provide a consistent behaviour again in case a database is moved to a different system supporting UTF8 libc collations.

jakob commented 7 years ago

Ugh — I didn’t know that using ICU collations as default isn’t possible yet – that’s a bummer.

Using –lc-collate=C is a nice idea to get consistent behavior when migrating the db to a different system. And maybe when people look at that setting, it becomes a bit mire obvious that they need to change something if they want sensible sorting.

tbussmann commented 2 years ago

actually, Apple did further complicate the situation with a change of the symlinks of collations between macOS 10.15 and 11: /usr/share/locale/en_US.UTF-8/LC_COLLATE now links to ../la_LN.ISO8859-1/LC_COLLATE instead of ../la_LN.US-ASCII/LC_COLLATE which allows some basic collation support within the range of ISO8859-1 characters. The underlaying collation files in la_LN did not change but the OS is now using a different one.

On up to macOS 10.15:

$ ls -l /usr/share/locale/en_US.UTF-8/LC_COLLATE 
lrwxr-xr-x  1 root  wheel  28  4 Jul  2018 /usr/share/locale/en_US.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
$ md5 /usr/share/locale/en_US.UTF-8/LC_COLLATE 
MD5 (/usr/share/locale/en_US.UTF-8/LC_COLLATE) = 7b008442fbaf5dfe7a10fb3d82a634ab

on macOS 11 and newer:

$ ls -l /usr/share/locale/en_US.UTF-8/LC_COLLATE 
lrwxr-xr-x  1 root  wheel  29  1 Jan  2020 /usr/share/locale/en_US.UTF-8/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
$ md5 /usr/share/locale/en_US.UTF-8/LC_COLLATE 
MD5 (/usr/share/locale/en_US.UTF-8/LC_COLLATE) = 8f25380a3395ebbb913fcfa9f51382b7

While this improves the user experience to certain degree, it introduces a risk of index corruption when updating the OS across this boundary (see #665)

Some tests to show the new behaviour on macOS 11:

$ printf "ä\na\nb" | sort
a
ä
b
$ psql -c "SELECT unnest(array['ä','a','b']) ORDER BY 1;"
 unnest 
--------
 a
 ä
 b
(3 rows)

In the old macOS, PostgresApp's default collation (en_US.UTF-8) behaved identically to 'C' (Query to verify: WITH char AS (SELECT chr(i) c, i FROM generate_series(1, 255) i) SELECT c1.c, c1.i, c2.c, c2.i FROM char c1 FULL OUTER JOIN char c2 ON (true) WHERE c1.c < c2.c COLLATE "en_US" AND c1.c >= c2.c COLLATE "C"; returns empty set) which allowed PostgreSQL to use some optimisations ("abbreviated keys" in btree, pattern matching with index support). In the new macOS versions you can still get the latter but need to specify the xxx_pattern_ops opclasses when creating the index.

Some tests to show the potential problems:

on a mac running macOS 10.15 or earlier, create a fresh database cluster / server and connect to it. Then:

CREATE TABLE colltest (i SERIAL, c TEXT UNIQUE);
INSERT INTO colltest (c) (SELECT chr(s) FROM generate_series(1, 255) s);
SELECT count(*) FROM colltest WHERE c < 'a'; --returns 96
INSERT INTO colltest (c) VALUES ('.'); --duplicate key violation, as expected

stop the cluster, copy the data directory over to a mac running macOS 11 or newer, start and connect

SELECT count(*) FROM colltest WHERE c < 'a'; --now returns 185 ('different result' anomaly)
SET enable_seqscan = off; --force use of index
SELECT count(*) FROM colltest WHERE c < 'a'; --returns 96 as is is using the index
SELECT * FROM colltest WHERE c = '.'; -- we see no record ('missing data' anomaly)
INSERT INTO colltest (c) VALUES ('.'); --no duplicate key violation!
SELECT * FROM colltest WHERE c = '.'; -- we see one record with id 257
RESET enable_seqscan;
SELECT * FROM colltest WHERE c = '.'; -- we see two records with id 46 and 257 ('duplicate key' anomaly)
REINDEX TABLE colltest; -- we get an error on the duplicate key

PostgresApp / PostgresApp

Problems with sort order (UTF8 locales don't work) #216