EndPointCorp / end-point-blog

End Point Dev blog
https://www.endpointdev.com/blog/
17 stars 65 forks source link

Comments for DBD::Pg, UTF-8, and Postgres client_encoding #395

Open phinjensen opened 6 years ago

phinjensen commented 6 years ago

Comments for https://www.endpointdev.com/blog/2011/01/dbdpg-utf-8-and-postgres-clientencoding/ By Greg Sabino Mullane

To enter a comment:

  1. Log in to GitHub
  2. Leave a comment on this issue.
phinjensen commented 6 years ago
original author: Theory
date: 2011-01-13T22:34:05-05:00

Is there any reason it coudn't support other client_encodings? If I set it to latin-1, then DBD::Pg could just decode it to utf8, right?

I know, if it's going to be decoded to Perl's internal utf8 format, one might as well set client_encoding to UTF-8. So maybe it's not worth it to support other encodings?

phinjensen commented 6 years ago
original author: David Christensen
date: 2011-01-13T22:53:04-05:00

Theory,

As I see it, there's really no reason to do anything more than set the client_encoding to 'utf-8' in the PQconnect() call; since Postgres will support converting any server_encoding to UTF-8, this is an easy way to avoid needing to maintain some sort of mapping between Postgres' concept of the encoding names and Encode.pm's naming of them. Anything other than SQL_ASCII (aka byte_soup or pg_enable_utf8) can be sensibly converted with minimal changes to DBD::Pg.

My personal concerns are be that applications would be unprepared to deal with data that has to this point been returned in the raw. I think that naïve applications will work fine, but those that implement application workarounds to support conversion to perl's internal format would possibly be affected the most by the change in behavior. I also think this is too useful of a change to not be included and enabled by default, so perhaps a major version bump of DBD::Pg would help indicate that something fairly substantial is changing in the interface.

Cheers,

David

phinjensen commented 6 years ago
original author: Jon Jensen
date: 2011-01-14T03:13:45-05:00

Greg, you said someone "raised the issue of marking ASCII-only strings as utf8".

What is the issue there? If the client_encoding is set to UTF-8, then what comes out of the database should be marked as UTF-8 in Perl, even if it happens to be using only the ASCII subset of UTF-8.

Or am I missing something there?

phinjensen commented 6 years ago
original author: Greg Sabino Mullane
date: 2011-01-14T10:29:53-05:00

An issue with setting client_encoding ourselves is when do we do it? And what if the client does not want us to (a reasonable request)? If we set it on startup, and then the client requests pg_enable_utf8 = 0, we should really revert to the default client_encoding (which we'd have to look up and then apply). We could connect, check what it is, and then change it UTF-8 if needed, after storing the old value, but that would be a separate transaction/command on every startup. Perhaps neither is a big deal. But I like the idea of "forcing" UTF-8 and then marking everything as utf8. If someone really wants a separate non-ASCII and non-UTF-8 encoding returned, they can set pg_enable_utf8, we revert the client_encoding, and return the raw bytes to them so they can do what they want with it. Of course, we'll also have to check the server_encoding and simply do nothing if it's SQL_ASCII: no client_encoding setting, and no utf8 string setting. David C, interesting point about the version number bump, please remind me of that when we get ready to release whatever we finally come up with. :)

phinjensen commented 6 years ago
original author: Greg Sabino Mullane
date: 2011-01-14T10:31:55-05:00

Jon, I agree, it's almost a silly concern, but it was raised on the bug. Personally, I think scanning for high bits is pointless: it's still UTF-8 (and utf8), even if it's only using a small subset of the available characters (e.g. ASCII)

phinjensen commented 6 years ago
original author: David Christensen
date: 2011-01-15T01:37:01-05:00

This comment has been removed by the author.

phinjensen commented 6 years ago
original author: David Christensen
date: 2011-01-15T01:37:20-05:00

Greg,

"If we set it on startup, and then the client requests pg_enable_utf8 = 0, we should really revert to the default client_encoding (which we'd have to look up and then apply)."

So is the issue here that pg_enable_utf8 is currently a handle-level attribute, which can be set at any time, not just as a connection parameter? In my personal use, I've always set it up with the ->connect() options hashref, not changed it dynamically.

Perhaps the answer here is to allow the specification of the client_encoding as the connection option/handle attribute, which when set would then issue the 'SET client_encoding' (when used as a connected $dbh attribute) or the appropriate "options=..." call to PQconnect() if specified in the ->connect call. However, I'm not sure of the general reason why you would want to support the specific client_encoding output. It seems to me that the only reason to care about the database encoding would be that you (think) you know what the encoding is but the database doesn't (aka SQL_ASCII). If you care otherwise, just ask for raw output and handle it yourself at the app-level using Encode.pm.

I'm fairly confident that we could get things working relatively seamlessly with the data coming back from the database, however, have we given consideration into what happens on input? If client_encoding is UTF-8, we'd just need to check the utf8 flag on the way in and/or set :utf8 on the filehandle's binmode (assuming of course that the app data is going to be sending us perl internal character data). However it seems that there would be a window for raw hi-bit data (particularly in apps which haven't given particular consideration to the character encoding) to sneak in on input, and users would presumably get unexpected "invalid encoding for UTF8" errors when they were not using data explicitly. (Maybe the answer here is to call utf8::upgrade on any inputted string, or to append to an empty string with the is_utf8 flag set.) The first option would taint the data from the caller's perspective, so seems like it's out as an option; the second would incur (I believe) a copy/possible reencode for the concatenation, which obviously hurts performance. Maybe we'd have to resort to saving the utf8 flag and restoring it when processing a string; I dunno, not a lot of great options here wrt backwards-compatiblity that don't tank performance. I'll mull about it for a bit more.

David

phinjensen commented 6 years ago
original author: David Christensen
date: 2011-01-15T01:55:23-05:00

Jon,

"What is the issue there? If the client_encoding is set to UTF-8, then what comes out of the database should be marked as UTF-8 in Perl, even if it happens to be using only the ASCII subset of UTF-8.

Or am I missing something there?"

This is one of the more confusing parts of the Perl Unicode handling, IMHO; the UTF-8 flag should have been named something else entirely, as it really just indicates that Perl will see the data as internally encoded as UTF-8, specifically to do with the handling of hi-bit characters in the data. The flag does not actually indicate that the data itself is valid UTF-8, as it can be set independently of the data (not recommended unless you know what you're doing, as you can flag an SV with arbitrary data as utf8, which does not automatically convert the octets to a valid UTF-8 representation). The concatenation of strings takes into consideration the UTF-8 flag in how it processes the requested action; if both strings have the same state for the UTF-8 flag, it's essentially a copy only for the underlying data, but if one of the strings has the flag set and the other does not, the concatenation has to first upgrade the non-utf-8-marked string to actual UTF-8, then do the concatenation, with the result tagged with the UTF-8 flag.

I suspect for most use cases, it would be fine to have the flag set on pure ASCII (encoding-wise, there's no issue); however there are some modules that can change their behavior depending on the state of the flag though, so there could be different code paths being taken that wouldn't strictly be needed if dealing with ASCII-only (or legacy 8-bit) or modules that may refuse to process data with the utf-8 flag set (ISTR some issues in the past with Digest::SHA1 as an example; since the algorithm is defined only on bytes and not characters, it's an error to pass wide-character data in). It may be that these modules have been updated to not care about the state of the flag or to ignore it and only throw an error when encountering a character with code point > 0xFF, so it may not be an actual issue, but at least the potential exists to cause some unexpected behavior.

David

phinjensen commented 6 years ago
original author: Jon Jensen
date: 2011-01-15T11:59:06-05:00

David, to me the argument that some Perl modules (still?) can't handle UTF-8 data is all the more reason why the flag should be consistently set.

If you don't set the UTF-8 flag when only ASCII-subset data is present, you're likely to have code that works most of the time, but when the occasion arises that the database returns some more-than-ASCII string, the code will fail.

Better to fail early and recognize that the module you're depending on isn't suitable, or else switch and request always ASCII encoding, isn't it?

phinjensen commented 6 years ago
original author: Darren Duncan
date: 2011-01-16T17:19:02-05:00

I'm with those that believe it is best to always set the UTF-8 flag to true when UTF-8 data is requested, regardless of whether only the subset ASCII repertoire is used.

phinjensen commented 6 years ago
original author: Darren Duncan
date: 2011-01-16T17:30:06-05:00

As an addendum, and I could suggest this to p5p, if the issue with marking ASCII as UTF-8 is about performance (optimized single-byte code paths would be skipped), one possibility that might work is adding another flag for Perl strings that says it is known that the ASCII subset is in use. This extra flag would be false by default, but if some operation in Perl decides to go through the work to check that no high bits are set, it can set the flag to true ... or it could be 3-valued, for known-high-set, known-high-not-set, not-known. Then code deciding to use an optimized path later can just look at the flag to help it decide what to do. Presumably such a change may not be binary compatible so would only come in a major release, or it might not be that useful. But in principle for a type system, I think it would be useful for implementations to be able to mark a value as being known to be of a particular subset of its otherwise declared type, which could help optimization greatly.

phinjensen commented 6 years ago
original author: Greg Sabino Mullane
date: 2011-01-18T10:17:48-05:00

David: Looks like the canonical way is to transform the data before hitting Digest::SHA1; see the recipe here: http://my.opera.com/cstrep/blog/2010/09/24/survival-guide-to-utf-8

phinjensen commented 6 years ago
original author: Greg Sabino Mullane
date: 2011-01-18T10:20:13-05:00

Darren: interesting idea, but I doubt it would go over well for Perl 5. Wonder if Perl 6 handles utf the same way as P5?