crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.47k stars 1.62k forks source link

invalid encoding: UTF-8 (ArgumentError) on FreeBSD #3271

Closed zarianu closed 7 years ago

zarianu commented 8 years ago

Hi,

thanks for your effort creating this clean and lean language.

Here's the problem code:

"12345".encode "UTF-8"

here's the exception:

invalid encoding: UTF-8 (ArgumentError)
[4351927] *CallStack::unwind:Array(Pointer(Void)) +87
[4351818] *CallStack#initialize:Array(Pointer(Void)) +10
[4351770] *CallStack::new:CallStack +42
[4335000] *raise<ArgumentError>:NoReturn +24
[4369181] *Iconv#initialize<String, String, Nil>:Nil +349
[4368810] *Iconv::new<String, String, Nil>:Iconv +58
[4364498] *String::encode<Slice(UInt8), String, String, MemoryIO, Nil>:Nil +114
[4364277] *String#encode<String>:Slice(UInt8) +85
[4333210] _start +1658
[4345673] main +41
[4331919] _start +367
[34366562304] ???

UTF-8 is in 'iconv -l' output on my box.

I'm using Crystal 0.18.7 (2016-07-25) from this repo on FreeBSD 10.3-RELEASE-p7 x86_64.

Is it me or some package problems?

Sija commented 8 years ago

Seems like package problems, see https://carc.in/#/r/18sy

ysbaddaden commented 8 years ago

It's certainly related to the FreeBSD port. The iconv integration may require some FreeBSD specific fixes. I believe encoding specs didn't pass, but I didn't verify much further, believing it failed because of the unsupported legacy multibytes encodings used in specs.

If a FreeBSD user can take a better look?

zarianu commented 8 years ago

@ysbaddaden Yes, sure, what should I do?

ysbaddaden commented 8 years ago

Try to further down where the error is actually raised and if it fails with other encodings too. It seems to be an ArgumentError so maybe listing available encodings isn't working correctly?

zarianu commented 8 years ago

It fails with other encodings too:

"12345".encode "CP1251"
invalid encoding: CP1251 (ArgumentError)
[4351927] *CallStack::unwind:Array(Pointer(Void)) +87
[4351818] *CallStack#initialize:Array(Pointer(Void)) +10
[4351770] *CallStack::new:CallStack +42
[4335000] *raise<ArgumentError>:NoReturn +24
[4369181] *Iconv#initialize<String, String, Nil>:Nil +349
[4368810] *Iconv::new<String, String, Nil>:Iconv +58
[4364498] *String::encode<Slice(UInt8), String, String, MemoryIO, Nil>:Nil +114
[4364277] *String#encode<String>:Slice(UInt8) +85
[4333210] _start +1658
[4345673] main +41
[4331919] _start +367
[34366562304] ???

As I can see, it fails in iconv.cr here:

    Errno.value = 0
    @iconv = LibC.iconv_open(to, from)
    if Errno.value != 0
      if original_from == "UTF-8"
        raise ArgumentError.new("invalid encoding: #{original_to}")

So it looks like some problems with libiconv bindings.

Is there any way to run specs myself on my box to gather more info?

ysbaddaden commented 8 years ago

Thank you!

There should be enough resources from the docs or our CONTRIBUTE.md to compile Crystal from sources, then running crystal spec spec/std/...

zarianu commented 8 years ago

Had a little time tinkering with that problem, here's my findings.

In fact, iconv_open libc call is working fine, it returns conversion descriptor as it should according to man iconv_open. The problem is with error handling code, looks like on FreeBSD errno is never set to zero after successful call.

man errno excerpt:

... When a system call detects an error, it returns an integer value indicating failure (usually -1) and sets the variable errno accordingly. (This allows interpretation of the failure on receiving a -1 and to take action accordingly.) Successful calls never set errno; once set, it remains until another error occurs. It should only be examined after an error. ...

I commented out exception raising code after iconv_open in src/iconv.cr for test and it worked like a charm - string was converted!

So, I think error handling code has to be corrected to check for LibC.iconv_open return value first.

man iconv_open excerpt:

... Upon successful completion of iconv_open(), it returns a conversion descriptor. Otherwise, iconv_open() returns (iconv_t)-1 and sets errno to indicate the error. ...

And of course, problem scope is much wider than iconv, as it relates to error handling code.

zarianu commented 8 years ago

Here's the output of specs run:

10993 examples, 6 failures, 41 errors, 11 pending

Failed examples:

crystal spec ./spec/std/string_spec.cr:1900 # String encode raises if illegal byte sequence
crystal spec ./spec/std/string_spec.cr:1910 # String encode raises if incomplete byte sequence
crystal spec ./spec/std/io/io_spec.cr:522 # IO encoding decode raises on incomplete byte sequence
crystal spec ./spec/std/io/io_spec.cr:530 # IO encoding decode says invalid byte sequence
crystal spec ./spec/std/io/io_spec.cr:658 # IO encoding encode raises on invalid byte sequence
crystal spec ./spec/std/io/io_spec.cr:673 # IO encoding encode raises on incomplete byte sequence
crystal spec ./spec/std/file_spec.cr:696 # File encoding writes with encoding
crystal spec ./spec/std/file_spec.cr:703 # File encoding reads with encoding
crystal spec ./spec/std/file_spec.cr:710 # File encoding opens with encoding
crystal spec ./spec/std/file_spec.cr:719 # File encoding does each line with encoding
crystal spec ./spec/std/file_spec.cr:728 # File encoding reads lines with encoding
crystal spec ./spec/std/string_spec.cr:1883 # String encode encodes
crystal spec ./spec/std/string_spec.cr:1906 # String encode doesn't raise on invalid byte sequence
crystal spec ./spec/std/string_spec.cr:1916 # String encode doesn't raise if incomplete byte sequence
crystal spec ./spec/std/string_spec.cr:1920 # String encode decodes
crystal spec ./spec/std/string_spec.cr:1925 # String encode decodes with skip
crystal spec ./spec/std/io/buffered_spec.cr:285 # IO::Buffered encoding decode gets_to_end
crystal spec ./spec/std/io/buffered_spec.cr:293 # IO::Buffered encoding decode gets
crystal spec ./spec/std/io/buffered_spec.cr:303 # IO::Buffered encoding decode gets big string
crystal spec ./spec/std/io/buffered_spec.cr:314 # IO::Buffered encoding decode gets big GB2312 string
crystal spec ./spec/std/io/buffered_spec.cr:324 # IO::Buffered encoding decode reads char
crystal spec ./spec/std/io/io_spec.cr:423 # IO encoding decode gets_to_end
crystal spec ./spec/std/io/io_spec.cr:430 # IO encoding decode gets
crystal spec ./spec/std/io/io_spec.cr:440 # IO encoding decode gets big string
crystal spec ./spec/std/io/io_spec.cr:450 # IO encoding decode gets big GB2312 string
crystal spec ./spec/std/io/io_spec.cr:461 # IO encoding decode gets with limit
crystal spec ./spec/std/io/io_spec.cr:468 # IO encoding decode gets with limit (small, no newline)
crystal spec ./spec/std/io/io_spec.cr:475 # IO encoding decode gets with limit (big)
crystal spec ./spec/std/io/io_spec.cr:482 # IO encoding decode gets with string delimiter
crystal spec ./spec/std/io/io_spec.cr:492 # IO encoding decode reads char
crystal spec ./spec/std/io/io_spec.cr:502 # IO encoding decode reads utf8 byte
crystal spec ./spec/std/io/io_spec.cr:512 # IO encoding decode reads utf8
crystal spec ./spec/std/io/io_spec.cr:538 # IO encoding decode skips invalid byte sequences
crystal spec ./spec/std/io/io_spec.cr:566 # IO encoding decode does skips when converting to UTF-8
crystal spec ./spec/std/io/io_spec.cr:572 # IO encoding decode decodes incomplete multibyte sequence with skip (#3285)
crystal spec ./spec/std/io/io_spec.cr:581 # IO encoding encode prints a string
crystal spec ./spec/std/io/io_spec.cr:590 # IO encoding encode prints numbers
crystal spec ./spec/std/io/io_spec.cr:608 # IO encoding encode prints bool
crystal spec ./spec/std/io/io_spec.cr:617 # IO encoding encode prints char
crystal spec ./spec/std/io/io_spec.cr:625 # IO encoding encode prints symbol
crystal spec ./spec/std/io/io_spec.cr:633 # IO encoding encode prints big int
crystal spec ./spec/std/io/io_spec.cr:641 # IO encoding encode puts
crystal spec ./spec/std/io/io_spec.cr:650 # IO encoding encode printf
crystal spec ./spec/std/io/io_spec.cr:666 # IO encoding encode skips on invalid byte sequence
crystal spec ./spec/std/io/memory_io_spec.cr:287 # MemoryIO encoding decode gets_to_end
crystal spec ./spec/std/io/memory_io_spec.cr:294 # MemoryIO encoding decode gets
crystal spec ./spec/std/io/memory_io_spec.cr:303 # MemoryIO encoding decode reads char

Specs failed with the same exception raised (ArgumentError in Iconv#initialize).

ysbaddaden commented 8 years ago

We (almost) always check the return value, but it appears we don't for for iconv. Thanks for reducing the issue, I'm fixing the errno handling for iconv_open, iconv_close and iconv.

zarianu commented 8 years ago

@ysbaddaden Just tried 0.19.3 on FreeBSD 10 and bug is not gone :( Here's code from 0.19.3 FreeBSD package stdlib iconv.cr (with old buggy-ish error handling):

    Errno.value = 0
    @iconv = LibC.iconv_open(to, from)
    if Errno.value != 0
      if original_from == "UTF-8"
        raise ArgumentError.new("invalid encoding: #{original_to}")
      elsif original_to == "UTF-8"
        raise ArgumentError.new("invalid encoding: #{original_from}")
      else
        raise ArgumentError.new("invalid encoding: #{original_from} -> #{original_to}")
      end
    end

Do you build FreeBSD package manually or have this automated via your CI? Thanks for great job, anyway.

ysbaddaden commented 8 years ago

Oh. Something wrong, indeed. I'll rebuild the tarball!

ysbaddaden commented 8 years ago

@zarianu I rebuilt and reuploaded the tarballs for 0.19.2 and 0.19.3 that were wrong. Tell me if it works for you (eventually). I still have failing specs related to specs, but they seem related to the non supported GB2312 encoding.

zarianu commented 8 years ago

@ysbaddaden Everything works for initial code I attached to this issue, thanks, but I'm getting same exception when I'm trying to use HTTP::Client:

invalid encoding: utf-8 (ArgumentError)
[4551911] *CallStack::unwind:Array(Pointer(Void)) +87
[4551802] *CallStack#initialize:Array(Pointer(Void)) +10
[4551754] *CallStack::new:CallStack +42
[4507144] *raise<ArgumentError>:NoReturn +24
[4592827] *Iconv#initialize<String, String, (Symbol | Nil)>:Nil +571
[4592236] *Iconv::new<String, String, (Symbol | Nil)>:Iconv +92
[4785775] *IO::Decoder#initialize<IO::EncodingOptions>:Bool +95
[4785650] *IO::Decoder::new<IO::EncodingOptions>:IO::Decoder +210
[4777988] *Zlib::Inflate +164
[4781757] *Zlib::Inflate +93
[4914291] *HTTP::Client::Response#consume_body_io:Nil +787
[4912076] *HTTP::Client::Response::from_io<(OpenSSL::SSL::Socket+ | TCPSocket+), Bool, Bool>:HTTP::Client::Response +1788
[4828142] *HTTP::Client#exec_internal<HTTP::Request>:HTTP::Client::Response +174
[4827702] *HTTP::Client#exec<HTTP::Request>:HTTP::Client::Response +38
[4826982] *HTTP::Client#exec<String, String, Nil, Nil>:HTTP::Client::Response +22
[4826951] *HTTP::Client#get<String>:HTTP::Client::Response +39
[4465804] _start +2332
[4521961] main +41
[4463839] _start +367
[34367119360] ???

I will try to dig down on this one now ;)

BTW, FreeBSD's iconv supports GB2312 encoding, so maybe failed specs are associated with some other issues (invalid byte count or something):

[kost@chef]% iconv -l | grep 2312
EUC-CN CN-GB CSGB3212 EUCCN GB2312
HZ HZ-GB-2312 HZ-GB2312 HZ8
zarianu commented 8 years ago

Looks like FreeBSD libc doesn't support GNU extensions to iconv interface (//IGNORE & //TRANSLIT), so there need to be some workaround (see __iconv() in FreeBSD man, or maybe link with GNU libiconv on FreeBSD platform?).

FreeBSD iconv man Linux iconv man

Too bad my days of low level/C programming has gone - I should have found this out much earlier.

sdogruyol commented 7 years ago

Since https://github.com/crystal-lang/crystal/pull/3379 is merged, can we close this?