FirebirdSQL / firebird

Firebird server, client and tools
https://firebirdsql.org
1.26k stars 217 forks source link

Convert non-ascii names of trigger and exception (+ non-ascii exception message) to UTF-8 before saving them in firebird.log or showing in a trace log #8271

Open pavel-zotov opened 1 month ago

pavel-zotov commented 1 month ago

When trigger on DISCONNECT raises exception, the name of trigger + name of exception + exception message are saved in firebird.log. Also, they all can be seen in the trace.log if we turn on 'log_errors' config parameter.

If some of them contain non-ascii characters then firebird.log / trace log may look unreadable: non-ascii characters will be displayed as mojibakes or could missed ("swallowed"). This occurs only when we use single-byte character set for connection. No such problem if conn charset = utf8.

It will be good if any non-ascii text will be always converted to utf8 before saving in firebird.log or displaying in the trace log. Attached .zip contains .sql scripts and result for cp1250, cp1251 and cp1252 - all of them are similar (and there is no difference whether i set font = 'consolas' or 'lucida console' in my cmd.exe).

Excel file "non-ascii-characters-in-firebird-and-trace-log_-_overall-outcome.xlsx" show all results as table for convenience. non-ascii-characters-in-firebird-and-trace-log.zip

AlexPeshkoff commented 1 month ago

Doubt that's good idea - in most cases one reads log in native OS charset.

mrotteveel commented 1 month ago

Doubt that's good idea - in most cases one reads log in native OS charset.

Most tools these days default to using UTF-8 for reading text files.

aafemt commented 1 month ago

BOM in firebird.log can help these tools.

mrotteveel commented 1 month ago

UTF-8 should not use a BOM.

aafemt commented 1 month ago

On contrary: this is the only way to distinguish between ANSI and UTF-8 text files. ANSI files cannot use BOM.

mrotteveel commented 1 month ago

The bytes of a UTF-8 BOM could occur in ANSI files, as they might be mapped to normal characters; it is just unlikely that they would be the first three bytes of the file.

Historically, the Unicode standard said that the BOM should not be used in UTF-8, but that they may occur when converting from other encodings. However, it seems they have relaxed that stance, since the Unicode 16 standard says:

Use of a BOM is not required for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

In older versions, e.g. Unicode 5, it said: "Use of a BOM is neither required nor recommended for UTF-8".

aafemt commented 1 month ago

In older versions, e.g. Unicode 5, it said: "Use of a BOM is neither required nor recommended for UTF-8".

I read it as "there is no requirement or recommendation about using BOM" instead of "there is requirement and recommendation not to use BOM".

mrotteveel commented 1 month ago

No, that is not how you should read the words required or recommended or their negation in a standard, but in any case, my objection is not founded in the current standard.

That said, in my experience modern tools use UTF-8 always when writing logs, and text editors have zero problems with reading UTF-8 files without a BOM, so I don't see the need to add a BOM, nor do I see a need to write a mismash of encodings in a single file that depends on the connection character set as we currently seem to do.

asfernandes commented 1 month ago

AFAIR we read config files in Windows using the system code page. Output of log should match input of config files IMO.

aafemt commented 1 month ago

AFAIR we read config files in Windows using the system code page.

That's why I opened #5479 during my attempt to make Firebird Unicode. Configs that are read in ANSI codepage only disallow usage non-ANSI path and file names.

pavel-zotov commented 1 month ago

Doubt that's good idea - in most cases one reads log in native OS charset.

On Linux default OS charset is UTF-8, right ? But this is not so for Windows. Even if i have FB server on machine with russian system locale - anyway it is possible to see unreadable results for scripts encoding in cp1251 and charset = win1251: image PS. Used host:

Windows-10, ver 22H2 build19045.3086.
systeminfo | findstr /b /i /c:"System Locale"
System Locale:             ru;Russian
mrotteveel commented 1 month ago

AFAIR we read config files in Windows using the system code page. Output of log should match input of config files IMO.

And that is not a good option, because then you can not properly render identifiers or statement texts that are sent with UTF-8, or in another connection character set than the system code page. We no longer live in the '90s.

mrotteveel commented 1 month ago

AFAIR we read config files in Windows using the system code page.

That's why I opened #5479 during my attempt to make Firebird Unicode. Configs that are read in ANSI codepage only disallow usage non-ANSI path and file names.

As long as Firebird mangles filenames of databases on Windows by uppercasing them, that will not help a lot as uppercasing sometimes results in the wrong uppercase variant for non-ASCII characters. (Separately, this uppercasing also breaks if an NTFS folder is set to case-sensitive.)

aafemt commented 1 month ago

Yes, and there is a lot of other places where Unicode in Firebird isn't handled at all. Any attempt to fix all of them at once will fail. This elephant must be eaten by pieces.

asfernandes commented 1 month ago

Yes, and there is a lot of other places where Unicode in Firebird isn't handled at all. Any attempt to fix all of them at once will fail. This elephant must be eaten by pieces.

That's why branches exist. To do things in piece and merge at once.

aafemt commented 1 month ago

That's why branches exist.

The problem is that at some point merges become too hard. You should know it with your schemas branch.

asfernandes commented 1 month ago

The problem is that at some point merges become too hard. You should know it with your schemas branch.

Of course, but this is software development.

INTL refactor and introduction of Unicode in v2.5 has been done in a long-term, massive change branch in CVS and it worked.

aafemt commented 1 month ago

Of course, but this is software development.

This is a bad style of development. 60+ waiting pull requests and almost no activity in master branch can make outsiders think that the project is dead.