Enums types are not properly create when unicode character is used [rt.cpan.org #123698]

Migrated from rt.cpan.org#123698 (status was 'open')

Requestors:

felix.ostmann@gmail.com

From felix.ostmann@gmail.com on 2017-11-21 09:54:01 :

The {extra}{list} enum values are not correct encoded. I use the same connection settings for the app itself and all data from the database are correctly encoded except this enum.

> \dT+
...
 steinhaus_main | enum_tasks_status   | enum_tasks_status   | 4     | offen         +| 
                |                     |                     |       | erledigt      +| 
                |                     |                     |       | zurÃ¼ckgestellt | 
...

$ grep status -C5 Tasks.pm
...
  "status",
  {
    data_type => "enum",
    default_value => "offen",
    extra => {
      custom_type_name => "enum_tasks_status",
      list => ["offen", "erledigt", "zur\xFCckgestellt"],
    },
    is_nullable => 0,
  },
...

the file is in utf8 with use utf8; in the beginning so i expected:

      list => ["offen", "erledigt", "zurÃ¼ckgestellt"],

From ilmari+cpan@ilmari.org on 2017-11-21 11:08:27 :

On 2017-11-21 09:54:01, felix.ostmann@gmail.com wrote:
> The {extra}{list} enum values are not correct encoded. I use the same
> connection settings for the app itself and all data from the database
> are correctly encoded except this enum.
> 
> 
> > \dT+
> ...
>   steinhaus_main | enum_tasks_status   | enum_tasks_status   | 4     |
> offen         +|
>                  |                     |                     |       |
> erledigt      +|
>                  |                     |                     |       |
> zurÃ¼ckgestellt |
> ...
> 
> 
> $ grep status -C5 Tasks.pm
> ...
>   "status",
>   {
>     data_type => "enum",
>     default_value => "offen",
>     extra => {
>       custom_type_name => "enum_tasks_status",
>       list => ["offen", "erledigt", "zur\xFCckgestellt"],
>     },
>     is_nullable => 0,
>   },
> ...
> 
> the file is in utf8 with use utf8; in the beginning so i expected:
> 
> list => ["offen", "erledigt", "zurÃ¼ckgestellt"],

These representations of the string are equivalent:

    $ perl -Mutf8 -E 'say "zur\xFCckgestellt" eq "zurÃ¼ckgestellt"'
    1

Schema::Loader uses Data::Dump to serialise method call arguments in the generated files, and it encodes all non-ASCII (and non-printable) characters using \x notation.

For aesthetic reasons it might be desirable to output Unicode word characters literally too, but the current output is not incorrect.

- ilmari

From felix.ostmann@gmail.com on 2017-11-21 11:43:13 :

Am Di 21. Nov 2017, 06:08:27, ilmari schrieb:
> On 2017-11-21 09:54:01, felix.ostmann@gmail.com wrote:
> > The {extra}{list} enum values are not correct encoded. I use the same
> > connection settings for the app itself and all data from the database
> > are correctly encoded except this enum.
> >
> >
> > > \dT+
> > ...
> >   steinhaus_main | enum_tasks_status   | enum_tasks_status   | 4
> > |
> > offen         +|
> >                  |                     |                     |
> > |
> > erledigt      +|
> >                  |                     |                     |
> > |
> > zurÃ¼ckgestellt |
> > ...
> >
> >
> > $ grep status -C5 Tasks.pm
> > ...
> >   "status",
> >   {
> >     data_type => "enum",
> >     default_value => "offen",
> >     extra => {
> >       custom_type_name => "enum_tasks_status",
> >       list => ["offen", "erledigt", "zur\xFCckgestellt"],
> >     },
> >     is_nullable => 0,
> >   },
> > ...
> >
> > the file is in utf8 with use utf8; in the beginning so i expected:
> >
> > list => ["offen", "erledigt", "zurÃ¼ckgestellt"],
> 
> These representations of the string are equivalent:
> 
> $ perl -Mutf8 -E 'say "zur\xFCckgestellt" eq "zurÃ¼ckgestellt"'
> 1
> 
> Schema::Loader uses Data::Dump to serialise method call arguments in
> the generated files, and it encodes all non-ASCII (and non-printable)
> characters using \x notation.
> 
> For aesthetic reasons it might be desirable to output Unicode word
> characters literally too, but the current output is not incorrect.
> 
> - ilmari

It is not really the same ...

In the real code i have to make a Encode::decode('ISO-8859-15', $enum) as a quickfix. 

$ cat ticket123698.pl 
use utf8;
use 5.20.0;
use Data::Dumper;
say "zur\xFCckgestellt" eq "zurÃ¼ckgestellt";
print Dumper("zur\xFCckgestellt","zurÃ¼ckgestellt");
$ perl ticket123698.pl 
1
$VAR1 = 'zurï¿½ckgestellt';
$VAR2 = "zur\x{fc}ckgestellt";

From ilmari@ilmari.org on 2017-11-21 12:07:59 :

"Felix Antonius Wilhelm Ostmann via RT"
<bug-DBIx-Class-Schema-Loader@rt.cpan.org> writes:

> It is not really the same ...

The _internal_ representation is not the same; the \x from will be
represented internally as one byte per code point ("downgraded"), while
the literal form will be utf-8-encoded ("upgraded"). Semantically they
are the same, as evidenced by "eq" returning true.

> In the real code i have to make a Encode::decode('ISO-8859-15', $enum) as a quickfix. 

Please show where in the real code you have to do this.  It smells like
something you're passing it to suffering from the Unicode Bug,
i.e. treating the characters in the 128..255 range differently depending
on the internal representation (see
https://metacpan.org/pod/perlunicode#The-%22Unicode-Bug%22 for details).

> $ cat ticket123698.pl 
> use utf8;
> use 5.20.0;
> use Data::Dumper;
> say "zur\xFCckgestellt" eq "zurÃ¼ckgestellt";
> print Dumper("zur\xFCckgestellt","zurÃ¼ckgestellt");
> $ perl ticket123698.pl 
> 1
> $VAR1 = 'zurï¿½ckgestellt';
> $VAR2 = "zur\x{fc}ckgestellt";

The different outputs here are a quirk of how Data::Dumper deals with
downgraded vs. upgraded strings (which could be viewed as an instance of
the Unicode Bug, but doesn't actually affect semantics).  The first one
is showing as ï¿½ because you haven't thold perl that your terminal
expects UTF-8-encoded strings.  Adding

    use open qw(:std :utf8);

to the script will make it apply a UTF-8 encoding layer to the standard
input/output/error filehandles, so non-ASCII charcters show correctly.

- ilmari
-- 
"I use RMS as a guide in the same way that a boat captain would use
 a lighthouse.  It's good to know where it is, but you generally
 don't want to find yourself in the same spot." - Tollef Fog Heen

From felix.ostmann@gmail.com on 2017-11-21 13:35:39 :

Am Di 21. Nov 2017, 07:07:59, ilmari@ilmari.org schrieb:
> "Felix Antonius Wilhelm Ostmann via RT"
> <bug-DBIx-Class-Schema-Loader@rt.cpan.org> writes:
> 
> > It is not really the same ...
> 
> The _internal_ representation is not the same; the \x from will be
> represented internally as one byte per code point ("downgraded"),
> while
> the literal form will be utf-8-encoded ("upgraded"). Semantically they
> are the same, as evidenced by "eq" returning true.
> 
> > In the real code i have to make a Encode::decode('ISO-8859-15',
> > $enum) as a quickfix.
> 
> Please show where in the real code you have to do this.  It smells
> like
> something you're passing it to suffering from the Unicode Bug,
> i.e. treating the characters in the 128..255 range differently
> depending
> on the internal representation (see
> https://metacpan.org/pod/perlunicode#The-%22Unicode-Bug%22 for
> details).
> 
> > $ cat ticket123698.pl
> > use utf8;
> > use 5.20.0;
> > use Data::Dumper;
> > say "zur\xFCckgestellt" eq "zurÃ¼ckgestellt";
> > print Dumper("zur\xFCckgestellt","zurÃ¼ckgestellt");
> > $ perl ticket123698.pl
> > 1
> > $VAR1 = 'zurï¿½ckgestellt';
> > $VAR2 = "zur\x{fc}ckgestellt";
> 
> The different outputs here are a quirk of how Data::Dumper deals with
> downgraded vs. upgraded strings (which could be viewed as an instance
> of
> the Unicode Bug, but doesn't actually affect semantics).  The first
> one
> is showing as ï¿½ because you haven't thold perl that your terminal
> expects UTF-8-encoded strings.  Adding
> 
> use open qw(:std :utf8);
> 
> to the script will make it apply a UTF-8 encoding layer to the
> standard
> input/output/error filehandles, so non-ASCII charcters show correctly.
> 
> - ilmari

OK, here is the real world scenario with pseudo code. I am using DBIx::Class + Catalyst + Template Toolkit

ResultSet:
sub enum_status {
    my ($self) = @_;
    # FIXME see https://rt.cpan.org/Public/Bug/Update.html?id=123698
    return map { Encode::decode("ISO-8859-15", $_) } @{ $self->result_source->column_info('status')->{extra}->{list} };
    return @{ $self->result_source->column_info('status')->{extra}->{list} };
}

Catalyst-Controller:
$c->stash->{status_order} = [ $rs->enum_status ];

Template:
[% FOREACH status IN status_order %]
<a href="[% c.request.uri_with({status => status}) %]">
[% END %]

Without the FIXME the links are ISO-8859-15

After reading your reply and docs about unicode-Bug i changed the code to the following:

__PACKAGE__->column_adds(
...
  {         
    data_type => "enum",
    default_value => "offen",  
    extra => {
      custom_type_name => "enum_tasks_status",
      list => ["offen", "erledigt", "zur\xFCckgestellt"],
    },      
    is_nullable => 0,          
  },
...
);
...
# DO NOT MODIFY THIS OR ANYTHING ABOVE! md5sum:W4KhHAXiEW35h5XWiZwhFg
utf8::upgrade($_) for @{ __PACKAGE__->column_info('status')->{extra}->{list} };

But in my option this is kind of a bug. Why are all other strings comming from the database already upgraded but not this?

dbsrgits / dbix-class-schema-loader

Enums types are not properly create when unicode character is used [rt.cpan.org #123698] #52