benbernard / RecordStream

commandline tools for slicing and dicing JSON records.
Other
300 stars 31 forks source link

Unicode and newline characters break totable #71

Open benbernard opened 8 years ago

benbernard commented 8 years ago

When characters like '…' and newlines are in an outputted column, the columns no longer line up. Should probably transform the output or something an also leave a --no-escape option or similar...

Probably also affects toptable. I think those are the only two that require text alignment...

copy from irc discussion:

10:20 <ben_bernard> ugh, just found an annoying bug with totable
10:21 <ben_bernard> if you have a newline in one of the fields, everything goes to hell
10:23 <amling> if you have a lot of things I think you can arrive at hell
10:23 <amling> any interesting unicode characters e.g.
10:23 <ben_bernard> ugh
10:23 <ben_bernard> also
10:23 <ben_bernard> yeah
10:23 <ben_bernard> elipsis
10:23 <ben_bernard> character
10:24 <ben_bernard> https://www.dropbox.com/s/8ea5w1of8g1sm7z/Screenshot%202015-11-24%2010.24.12.png?dl=0
10:24 <ben_bernard> I feel like this is a conquerable problem
10:25 <amling> oh the twittening
10:25 <amling> you can probably super jackass it with pre-xform '{{text}} = some_perl_module_text_escaper({{text}})'
10:25 <amling> ultimately we might like very clear text output sinks to do that internally
10:25 <amling> just like table already automatically encodes non-strings into JSON (IIRC)
10:26 <ben_bernard> yeah, that is what I was thinking
10:26 <amling> uh
10:26 <amling> there's also color
10:26 <amling> and we may want to leave a super insane --no-encode or something around
10:27 <amling> actually color is also fucked by --no-encode
10:27 <amling> hmm
10:27 <amling> okay, maybe --no-encode is useless
10:27 <amling> anyway, should maybe think about color at the same time
10:27 <ben_bernard> yeah
10:28 <amling> some combination of encoding for text output (I think '\n' is unavoidable) and deciding effective width (color is zero, weird unicode stuff is ... something else)
benbernard commented 8 years ago

Some additional investigation by @amling and me has yielded this:

use open (':std', ':utf8');

(also maybe use utf8;).

That first one will make the input streams be read as utf8, which should fix the ellipsis encoding issue (still need to handle new lines)

tsibley commented 8 years ago

Yeah, I've worked around UTF-8 safeness of recs before using env PERL_UNICODE=SAD, which is roughly equivalent to your use open line.

Note that :encoding(UTF-8) is preferred to :utf8 for the open pragma since :utf8 means Perl's internal Unicode representation (lax, not strict) and :encoding(UTF-8) means the actual UTF-8 standard.

use utf8 only serves to inform perl that your file is encoded as UTF-8 and therefore may contain string literals with UTF-8 characters. It doesn't affect I/O directly.

bokutin commented 8 years ago

Hello.

I am a Japanese user.

The other day, I encountered what might be same root of the problem.

screenshot

The value of the CORE::length might be different from the width.

You can get the width of the utf8 in module called AAA. http://search.cpan.org/perldoc?Text::VisualWidth::PP

I am sure that the work well in the patch below.

# diff -u /usr/local/lib/perl5/site_perl/App/RecordStream/Operation/totable.pm.orig /usr/local/lib/perl5/site_perl/App/RecordStream/Operation/totable.pm
--- /usr/local/lib/perl5/site_perl/App/RecordStream/Operation/totable.pm.orig   2016-04-16 16:13:42.948118000 +0900
+++ /usr/local/lib/perl5/site_perl/App/RecordStream/Operation/totable.pm        2016-04-16 16:20:05.791268000 +0900
@@ -41,6 +41,12 @@
   $this->{'CLEAR'}         = $clear;
 }
+use Encode;
+use Text::VisualWidth::PP;
+sub _length {
+  Text::VisualWidth::PP::width( utf8::valid($_[0]) ? decode_utf8($_[0]) : $_[0] );
+}
+
 sub stream_done {
   my $this = shift;
@@ -58,14 +64,14 @@
         push @$fields, $field;
       }
-      $widths{$field} = max($widths{$field}, length($this->extract_field($record, $field)));
+      $widths{$field} = max($widths{$field}, _length($this->extract_field($record, $field)));
     }
   }
   my $no_header = $this->{'NO_HEADER'};
   if(!$no_header) {
     foreach my $field (keys(%widths)) {
-      $widths{$field} = max($widths{$field}, length($field));
+      $widths{$field} = max($widths{$field}, _length($field));
     }
   }
@@ -164,9 +170,9 @@
     }
     if ( (! $this->{'SPREADSHEET'}) &&
-      (length($field_string) < $widthsr->{$field})) {
+      (_length($field_string) < $widthsr->{$field})) {
-      $field_string .= " " x ($widthsr->{$field} - length($field_string));
+      $field_string .= " " x ($widthsr->{$field} - _length($field_string));
     }
     if($first) {

Please use you were probably used.

benbernard commented 8 years ago

Awesome! thanks for the patch @bokutin I'll try to get it into master soon (needs tests too), or if you want feel free to provide a pull request.

Very much thanks for the contribution

benbernard commented 8 years ago

A note here for later: this certainly also affects toptable

tsibley commented 8 years ago

Unfortunately the patch above is fixing a symptom, not the root cause. The root cause is that recs doesn't decode/encode its input/output streams and instead treats them as bytes. This is why length() is returning the number of bytes instead of the number of characters.

Additionally, using the utility functions inside the utf8 pragma is generally considered bad practice. Reading the documentation for the utf8::valid() function (and the related function utf8::is_utf8()) should make it clear that they test the internal-to-Perl state of a string and therefore do not mean what it may seem they mean.

The proper solution is to consistently decode/encode I/O within all of recs, most likely in one of the top-level stream classes. Requiring I/O be UTF-8 is not unreasonable this day in age, and folks can either explicitly convert when passing to recs or recs could gain a global --encoding option.

To learn more about Unicode and Perl, I highly suggest reading Tom Christiansen's Perl Unicode Essentials talk.

benbernard commented 8 years ago

Yeah, good point, I was going to try your suggested route first, and it seems like doing that in InputStream would be the beset approach... I'm also not opposed to just getting a fix out there, but yeah, agreed on the right way to approach this

-Ben

On Thu, May 26, 2016 at 9:44 AM Thomas Sibley notifications@github.com wrote:

Unfortunately the patch above is fixing a symptom, not the root cause. The root cause is that recs doesn't decode/encode its input/output streams and instead treats them as bytes. This is why length() is returning the number of bytes instead of the number of characters.

Additionally, using the utility functions inside the utf8 pragma is generally considered bad practice. Reading the documentation for the utf8::valid() function (and the related function utf8::is_utf8()) should make it clear that they test the internal-to-Perl state of a string and therefore do not mean what it may seem they mean.

The proper solution is to consistently decode/encode I/O within all of recs, most likely in one of the top-level stream classes. Requiring I/O be UTF-8 is not unreasonable this day in age, and folks can either explicitly convert when passing to recs or recs could gain a global --encoding option.

To learn more about Unicode and Perl, I highly suggest reading Tom Christiansen's Perl Unicode Essentials talk https://tsibley.github.io/tchrist-OSCON2011-Unicode/pue.html.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/benbernard/RecordStream/issues/71#issuecomment-221927307

tsibley commented 8 years ago

Nod. The patch may fix the specific case presented, but it introduces other bugs due to treating Perl's internal SV state as a "UTF-8 sniffer." I don't recommend applying it even for the short term.

tsibley commented 8 years ago

However! Text::VisualWidth::PP may still be useful, and I'm glad @bokutin mentioned it. :-)

benbernard commented 8 years ago

Yep, agreed on both points, you're obviously the encoding expert here :)

On Thu, May 26, 2016 at 9:50 AM Thomas Sibley notifications@github.com wrote:

However! Text::VisualWidth::PP may still be useful, and I'm glad @bokutin https://github.com/bokutin mentioned it. :-)

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/benbernard/RecordStream/issues/71#issuecomment-221928967

bokutin commented 8 years ago

Yes, my patch I think that it is shortsighted.

On the other hand, in order to resolve correctly, it may be difficult.

I may wish to prepare the App::RecordStream::Operation::totable::UTF8 for myself.

tsibley commented 8 years ago

Your screenshot with Encode misunderstands how string literals work in Perl. You need to use utf8 if you're going to put UTF-8 characters in string literals. Otherwise, the strings are treated as bytes/ASCII characters. In your last example, the one which prints 0, you decode your byte-literal into a character string and then compare it to the byte-literal. Perl is properly saying they don't match.

Look at this example which compares characters to characters: it decodes a byte-string on the left hand side and declares string literals as UTF-8 characters on the right hand side:

$ perl -MEncode -E 'say decode_utf8("♥") eq do { use utf8; "♥" } ? 1 : 0'
1

Encodings must always be handled explicitly. Handling them implicitly is a recipe for bugs and confusion. APIs must declare if they deal in characters (i.e. decoded bytes) or in bytes (i.e. encoded characters).

For recs and other tools, the sanest choice is picking a default that will cause the least amount of surprise for as many users and allowing users who need something other than the default to change it. In this case, I'm suggesting recs default to doing UTF-8 I/O and provide CLI options (+ environment vars?) to change that default.

bokutin commented 8 years ago

Thank you for you pointed out.

I agree to the tsibley's suggestions. (^ ^)

If the impact of the encoding is also like to recs-grep and recs-eval You might need to include the use utf8 to eval code.

tsibley commented 8 years ago

Any Perl snippets that recs evaluates will likely get an implicit use utf8 added by recs. This will make the default cases Just Work, while still allowing people to declare their snippets as non-UTF-8 with no utf8.

bokutin commented 7 years ago

http://search.cpan.org/~exodist/Term-Table-0.009-TRIAL/lib/Term/Table.pm#NOTE_ON_UNICODE/WIDE_CHARACTERS