fix input according to locale rather perlio default

jcarnu commented 3 years ago

If logs contains UTF-8 characters JSON output embeds "badly converted" characters :

echo "2021-06-08 13:51:04.966 UTC [22436-2] postgres@postgres LOG:  duration: 5025.921 ms  statement: select ' € € ', pg_sleep(5);"|perl pgbadger -x json -f stderr  -p '%m [%p-%l] %q%u@%d ' -o /tmp/test.json - && cat /tmp/test.json | jq . |grep sleep
[+] Parsed 129 bytes, queries: 1, events: 0
LOG: Ok, generating json report...
      "select ?, pg_sleep(?);": {
            "query": "select ' â¬ â¬ ', pg_sleep(5);",
        "select ' â¬ â¬ ', pg_sleep(5);",

This patch produces a correct query output with the given locale (locale should be adapted to input file locale) :

echo "2021-06-08 13:51:04.966 UTC [22436-2] postgres@postgres LOG:  duration: 5025.921 ms  statement: select ' € € ', pg_sleep(5);"|LANG=en_US.UTF-8 perl pgbadger -x json -f stderr  -p '%m [%p-%l] %q%u@%d ' -o /tmp/test.json - && cat /tmp/test.json | jq . |grep sleep
[+] Parsed 125 bytes, queries: 1, events: 0
LOG: Ok, generating json report...
        "select ' € € ', pg_sleep(5);",
      "select ?, pg_sleep(?);": {
            "query": "select ' € € ', pg_sleep(5);",

I didn't made extended tests on other output formats though it would be useful to be sure there won't be any regression.

Here are the simple tests I've made :

for fmt in 'html' 'tsung' 'text'; do 
         echo "***** ${fmt} *****"
        echo "2021-06-08 13:51:04.966 UTC [22436-2] postgres@postgres LOG:  duration: 5025.921 ms  statement: select ' € € ', pg_sleep(5);"|perl pgbadger -x ${fmt} -f stderr  -p '%m [%p-%l] %q%u@%d ' -o /tmp/test - && cat /tmp/test  |grep sleep
done

and outputs

***** html *****
[+] Parsed 125 bytes, queries: 1, events: 0
LOG: Ok, generating html report...
Wide character in print at pgbadger line 11618, <DATA> line 122.
Wide character in print at pgbadger line 11865, <DATA> line 122.
Wide character in print at pgbadger line 12120, <DATA> line 122.
Wide character in print at pgbadger line 12375, <DATA> line 122.
    <span class="kw2">pg_sleep</span><span class="br0">(</span><span class="nu0">5</span><span class="br0">)</span>;
    <span class="kw2">pg_sleep</span><span class="br0">(</span>?<span class="br0">)</span>;
    <span class="kw2">pg_sleep</span><span class="br0">(</span><span class="nu0">5</span><span class="br0">)</span>;
    <span class="kw2">pg_sleep</span><span class="br0">(</span>?<span class="br0">)</span>;
    <span class="kw2">pg_sleep</span><span class="br0">(</span><span class="nu0">5</span><span class="br0">)</span>;
    <span class="kw2">pg_sleep</span><span class="br0">(</span>?<span class="br0">)</span>;
    <span class="kw2">pg_sleep</span><span class="br0">(</span><span class="nu0">5</span><span class="br0">)</span>;
***** tsung *****
Wide character in print at pgbadger line 16443.
Illegal division by zero at pgbadger line 17512, <GEN3> line 1.
LOG: Ok, generating tsung report...
    <request><pgsql type='sql'><![CDATA[select ' € € ', pg_sleep(5);]]></pgsql></request>
***** text *****
[+] Parsed 125 bytes, queries: 1, events: 0
LOG: Ok, generating text report...
Wide character in print at pgbadger line 5783, <DATA> line 122.
Wide character in print at pgbadger line 5808, <DATA> line 122.
Wide character in print at pgbadger line 5820, <DATA> line 122.
Wide character in print at pgbadger line 5848, <DATA> line 122.
Wide character in print at pgbadger line 5860, <DATA> line 122.
Wide character in print at pgbadger line 5888, <DATA> line 122.
Wide character in print at pgbadger line 5900, <DATA> line 122.
1) 5s25ms database: postgres, user: postgres - select ' € € ', pg_sleep(5);
1) 5s25ms - 1 - 5s25ms/5s25ms/5s25ms - select ' € € ', pg_sleep(5);
    - Example 1: 5s25ms - database: postgres, user: postgres - select ' € € ', pg_sleep(5);
1) 1 - 5s25ms - 5s25ms/5s25ms/5s25ms - select ' € € ', pg_sleep(5);
    Example 1: 5s25ms - database: postgres, user: postgres - select ' € € ', pg_sleep(5);
1) 5s25ms/5s25ms/5s25ms - 1 - 5s25ms - select ' € € ', pg_sleep(5);
    Example 1: 5s25ms - database: postgres, user: postgres - select ' € € ', pg_sleep(5);

It's quite basic (rough) but it shows up "Wide Character in print" warnings :

For the HTML output it seems to be (top queries).
For text files output seems fine also.
json file is ok.

I'd like to get rid of such warning messages though could you give me more kind guidelines to finish this PR without warning (your thoughts). If not possible I'll try to find a correct way by myself.

Please take note of tsung output with illegal division by zero on line

sprintf(
                        "[%-${width}s] Parsed %${num_width}s bytes of %s (%.2f%%), $lbl: %d\r",
                        $char x $nchars . '>',
                        $got, $total, 100 * $got / +$total, ($queries || $overall_stat{'queries_number'})
                );

where $total seems to be zero. Others output treats it like this (+$total||1)

If this fix is correct, I'll include it in another PR.

Thanks for creating pgbadger, it's an everyday friend for me !

darold commented 3 years ago

Thanks @jcarnu for the test to reproduce. Actually I would prefer to force the output to utf8 using:

use utf8;

binmode STDIN, ':utf8';
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';

the test reports the right characters whatever is the locale:

$ echo "2021-06-08 13:51:04.966 UTC [22436-2] postgres@postgres LOG:  duration: 5025.921 ms  statement: select ' € € ', pg_sleep(5);"|perl pgbadger -x json -f stderr  -p '%m [%p-%l] %q%u@%d ' -o /tmp/test.json - && cat /tmp/test.json | jq . |grep sleep
[+] Parsed 125 bytes, queries: 1, events: 0
LOG: Ok, generating json report...
        "select ' € € ', pg_sleep(5);",
      "select ?, pg_sleep(?);": {
            "query": "select ' € € ', pg_sleep(5);",

The problem with use open pragma is that is don't cover all cases. For example set your locale as follow:

$ export LANG=C
$ locale
LANG=C
LANGUAGE=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=

Then with your patch execute the test and you will have this output:

$ echo "2021-06-08 13:51:04.966 UTC [22436-2] postgres@postgres LOG:  duration: 5025test.json - && cat /tmp/test.json | jq . |grep sleeprl pgbadger -x json -f stderr  -p '%m [%p-%l] %q%u@%d ' -o /tmp/t
[+] Parsed 147 bytes, queries: 1, events: 0
LOG: Ok, generating json report...
        "select ' \\xE2\\x82\\xAC \\xE2\\x82\\xAC ', pg_sleep(5);",

now replace the pragma with :utf8 instead of :localeusing use open ':locale';, you will now have the following output:

$ echo "2021-06-08 13:51:04.966 UTC [22436-2] postgres@postgres LOG:  duration: 5025test.json - && cat /tmp/test.json | jq . |grep sleeprl pgbadger -x json -f stderr  -p '%m [%p-%l] %q%u@%d ' -o /tmp/t
[+] Parsed 129 bytes, queries: 1, events: 0
LOG: Ok, generating json report...
      "select ?, pg_sleep(?);": {
            "query": "select ' â¬ â¬ ', pg_sleep(5);"

If you use my patch above that use binmod on all standard I/O , you will have the right output:

$ echo "2021-06-08 13:51:04.966 UTC [22436-2] postgres@postgres LOG:  duration: 5025test.json - && cat /tmp/test.json | jq . |grep sleeprl pgbadger -x json -f stderr  -p '%m [%p-%l] %q%u@%d ' -o /tmp/t
[+] Parsed 125 bytes, queries: 1, events: 0
LOG: Ok, generating json report...
      "select ?, pg_sleep(?);": {
            "query": "select ' € € ', pg_sleep(5);",
        "select ' € € ', pg_sleep(5);",

I have noted the Illegal division by zero issue on tsung output. I can fix that but I'm not sure that the tsung output is used and maybe it can be removed to simplify the code.

jcarnu commented 3 years ago

Hello Gilles,

Thanks for your reply.

My current follow-up is only to improve or broaden the discussion on input/output encodings.

I agree use utf8; may be a good choice. It may handle a large number of use-cases.

I do not have your "users' experience feedback/background" so I was wondering on how things should work in case of "queries with others encoding not handled by UTF8" (Asian languages in particular).

My point was to say : if we know there's a specific locale into logged queries, there should be a «nice» way to handle them through the :locale thing. The main point of :locale thing is to use the "encoding" part of it. But I agree this should be a side use-case.

OTOH we could think of a feature to force input encoding (not locale) like a specific switch...

Thanks.

darold / pgbadger

fix input according to locale rather perlio default #667