Why Latin1 for report tables?

gggeek commented 1 year ago

Is it ok to use eg. utf8 charset when defining the reports tables? It would seem to make sense, given the script/hostname data present in them

anton-povarov commented 1 year ago

Shouldn't be an issue to use whatever charset you prefer. Have you tried? The engine itself does not interpret strings, just stores them as a bunch of bytes.

gggeek commented 1 year ago

Thx for the info. I will give it a try and come back with the results

anton-povarov commented 1 year ago

One potential issue i can think of is protobuf-c implementation on the client side. In case you use it from php - it might store them as c-strings (i.e. terminating null byte) - shouldn't be a problem for utf8, but maybe other charsets. Engine itself, parses protobuf as data+length, so null bytes should not be a problem there.

In any case - curious to see if it works for you, if it doesn't - we'll investigate.

gggeek commented 1 year ago

I did test this locally with "simple" utf8 script such as 'κόσμε', and it works.

The Pinba2 server in use was the official container from this project. The client-side were the php pinba extension (ver. 1.1.1) from Ubuntu Focal, and one pure-php protobuf implementation.

I created the reports table as

CREATE TABLE `report_by_script_name` (
    `script` varchar(64) NOT NULL,
    ...
) ENGINE=PINBA DEFAULT CHARSET=utf8 COMMENT='v2/request/60/~script/no_percentiles/no_filters';

And I had to add in the code querying its data: $db->set_charset('utf8');

gggeek commented 1 year ago

ps: about NULL chars: indeed, if I add chr(0) in the middle of the script_name, the corresponding line does not show up any more in the db after calling flush on the client side. I did not investigate the raw protobuf packet to find out what was being sent (or not) in that case...

anton-povarov commented 1 year ago

Ok, great to know. I'll keep this one open for a bit longer, please let me know if you get to check out the null chars - is it a client php extension issue or an engine issue.

gggeek commented 1 year ago

WDYT about changing the sql files which are part of this repo? At the very least we could add a comment in them mentioning Latin1 is not a requirement...

anton-povarov commented 1 year ago

Sure, happy to accept a PR with utf8 everywhere in sql files.

gggeek commented 1 year ago

Sure, will do.

Btw, I checked the "raw message" being sent when there is a NUL byte in the middle of the php string used to indicate the script_name, using Wireshark as protobuf dissector:

both the php extension and my pure-php polyfill do send a message in that scenario
the packet sent by the php extension is dissected as having a string with a value made of all chars up to the null one, excluding it
the packet sent by the polyfill apparently sends "garbled" data

So it seems that the code dropping those packets might be on the server-side (I will test that again with Pinba2, last tests were ran against Pinba1)

gggeek commented 1 year ago

if I add chr(0) in the middle of the script_name, the corresponding line does not show up any more in the db after calling flush on the client side

Further testing disproved the above. I ran fresh tests with a NUL char in the middle of hostname, server_name and script_name, after having patched my own lib to work the same way as the php extension (ie. truncate strings at the first NUL char found), and both pinba1 and pinba2 engines are happy with what they get. I wonder why I got no-data results in the first rounds of testing... probably because sending a single NUL char results in an empty string, which might not be considered a valid script_name/server_name/hostname. In any case: sending of NUL chars does not seem to have much to do with the chosen character set encoding for retrieving the data once it's in the DB (at least as long we are talking about UTF8 vs Latin1 - they both use the same encoding for NUL). I will send a PR as discussed above.

anton-povarov commented 1 year ago

I kind of assumed that pinba and pinba2 engines should treat strings with null-bytes in the middle differently, attributing to slightly different ways they unpack incoming protobuf packets. Pinba1 - truncating, pinba2 - preserving. For uts8 this point is moot since there is you're "not supposed to" havre null-bytes inside strings.

Thank you for the PR's, have been very busy lately, will review asap.

gggeek commented 1 year ago

I kind of assumed that pinba and pinba2 engines should treat strings with null-bytes in the middle differently, attributing to slightly different ways they unpack incoming protobuf packets.

This is possible.

What I did verify is that, for the pinba php extension, the truncation happens on the client side - possibly within code generated by the protoc compiler - so in those tests the NUL char was not sent to the engine at all.

As for being able to send a NUL char as part of the protobuffer packet: I did manage to do that by using my own php implementation of the client, but the results were mixed. I tried to decode the raw protobuf packet to check its contents: using protoc --decode_raw, the string was decoded correctly, and NUL displayed as \u0000; using wireshark the string was highlighted as being malformed. This prompted me to alter the behaviour of the php code to behave the same way as the php extension, as there seems to be little point in being able to send NUL chars if in any case all it would be helpful for is to expose weird corner cases in the receiver side. For all I know, it could even be used for some hacking attempt :-) I can not report reliably of behaviour differences between pinba1 and pinba2.

For uts8 this point is moot since there is you're "not supposed to" have null-bytes inside strings.

Not to be pedantic, but the NUL character is valid in utf8, just as it is valid in ascii. It is just C strings that have issues with that ;-)

badoo / pinba2

Why Latin1 for report tables? #28