caveman-dick / rubyripper

Automatically exported from code.google.com/p/rubyripper
0 stars 0 forks source link

incompatible character encodings: UTF-8 and ASCII-8BIT #449

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
1) Please describe the steps to reproduce the situation:
a. Start rrip_gui from terminal
b. Rip a Chinese CD encoding with flac, with UTF-8 Chinese tags and filename.

2) What is the expected output? What do you see instead?
I see errors :P

3) What version of rubyripper are you using? On what operating system? The
gtk2 of commandline interface?
rubyripper version: 0.6.0 latest from git
Operating System: Arch Linux
interface: gtk2

4) Is this not already fixed with the latest & greatest code? See for
instructions the Source tab above.
Nope.

5) Does the problem happen with all discs? If not, please attach
the output of cdparanoia -Q with a disc that gives trouble.
Nope.

6) Please explain why this change is important for you. Also, how many
users would benefit from this change?
It's a bug. Anyone ripping a CJK CD will probably face this problem. So yeah, 
it's very important.

--------------------------------------------------------------------

I was ripping a Chinese CD and got some problems regarding the string encoding.
Let me start with what I did to debug the problem:

Replace line 2334 in rr_lib.rb:

    command +="flac #{@settings['flacsettings']} -o \"#{filename}\" #{tags}\
\"#{@out.getTempFile(track, 1)}\""

with this block of code (do += step by step):

    command +="flac "
    command +="#{@settings['flacsettings']} "
    STDERR.printf("debug: %s :: %s\n", command, command.encoding)
    command += "-o \"#{filename}\" "
    STDERR.printf("debug: %s :: %s\n", command, command.encoding)
    command += "#{tags} "
    STDERR.printf("debug: %s :: %s\n", command, command.encoding)
    s = "\"#{@out.getTempFile(track, 1)}\"" # NOTE
    STDERR.printf("debug: %s :: %s\n", s, s.encoding.to_s) # NOTE
    command += "\"#{@out.getTempFile(track, 1)}\"" # Error here
    STDERR.printf("debug: %s :: %s\n", command, command.encoding)
    command += " 2>&1" unless @settings['verbose']
    STDERR.printf("debug: %s :: %s\n", command, command.encoding)

And I got some debug info:

debug: flac --best -V  :: US-ASCII
debug: flac --best -V -o "/tmp/flac/周杰伦/[2005] 十一月的肖邦/01. 
夜曲.flac"  :: ASCII-8BIT
debug: flac --best -V -o "/tmp/flac/周杰伦/[2005] 十一月的肖邦/01. 
夜曲.flac" --tag ALBUM="十一月的肖邦" --tag DATE="2005" --tag 
GENRE="Pop" --tag DISCID="A70C310C" --tag ARTIST="周杰伦" --tag 
TITLE="夜曲" --tag TRACKNUMBER=1 --tag TRACKTOTAL=12  :: ASCII-8BIT
debug: "/tmp/flac/周杰伦/temp_sr0/track1_1.wav" :: UTF-8
/usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:2344:in `flac': incompatible character 
encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:2233:in `doFlac'
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:2177:in `encodeTrack'
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:2136:in `block in startEncoding'
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:2133:in `each'
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:2133:in `startEncoding'
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:2125:in `addTrack'
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:1825:in `ripTrack'
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:1795:in `block in ripTracks'
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:1791:in `each'
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:1791:in `ripTracks'
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:1785:in `initialize'
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:2490:in `new'
    from /usr/lib/ruby/site_ruby/1.9.1/rr_lib.rb:2490:in `startRip'
    from /usr/bin/rrip_gui:305:in `block in do_rip'

I noticed two things here:

    1. command.force_encoding("UTF-8") doesn't ensure command to be always UTF-8, its encoding will change after an += operation.
    2. The getTempFile method seams to always return a UTF-8 string (in my case anyway; my system locale is en_US.utf8), while filename and tags here are ASCII-8BIT. When you try to join these two together, it oops.

    Here is a script that better demonstrates my point:

        #!/usr/bin/env ruby
        a = "\xe5\x91\xa8".force_encoding("UTF-8")
        b = "\xe6\x9d\xb0\xe4\xbc\xa6".force_encoding("UTF-8")
        c = String.new.force_encoding("UTF-8")
        c += "\xe5\x91\xa8"
        printf("a: %s ::: %s\n", a, a.encoding)
        printf("b: %s ::: %s\n", b, b.encoding)
        printf("c: %s ::: %s\n", c, c.encoding)
        a += b
        printf("a+b: %s ::: %s\n", a, a.encoding)
        a += "#{b} #{c}" # Error here
        printf("a: %s ::: %s\n", a, a.encoding)

    The output:

        $ ruby foo.rb
        a: 周 ::: UTF-8
        b: 杰伦 ::: UTF-8
        c: 周 ::: ASCII-8BIT
        a+b: 周杰伦 ::: UTF-8
        foo.rb:11:in `<main>': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)

I've noticed the string encoding issues have always been coming up every now 
and then.
Maybe I'm being totally ignorant, but my question is, why not use "# encoding: 
utf-8" in the first place? Won't this "magic comment" make both ruby and 
everyone happy?

Best regards

Original issue reported on code.google.com by loliloli...@gmail.com on 16 Oct 2010 at 2:14

GoogleCodeExporter commented 8 years ago
For the sake of simplicity, run this foo.rb instead.
(Can't "edit post" here, :/)

Original comment by loliloli...@gmail.com on 16 Oct 2010 at 2:25

Attachments:

GoogleCodeExporter commented 8 years ago
I've attached a _temporary_ fix, in case anyone is need.
BUT, I highly doubt this force_encoding fix should have been employed at all...
Again, I'd like to ask: What's the problem with "# encoding: utf-8" exactly?

Original comment by loliloli...@gmail.com on 16 Oct 2010 at 2:41

Attachments:

GoogleCodeExporter commented 8 years ago
This has probably to do with the fact that the encoding in freedb is not UTF-8 
compliant. While refactoring rubyripper, I discovered a problem where the 
Freedb source is coded in ISO-8859-1. Rubyripper will falsely treat this as 
UTF-8 and the result is the mess you're seeing.

The ISO-8859-1 case is already fixed, but perhaps there is another encoding 
here. You can help by providing the discid for this disc. The result from 
cd-discid or discid for this disc will do great ;) I can write a test to ensure 
the encoding will be fine from now on.

Original comment by boukewou...@gmail.com on 30 Nov 2010 at 8:17

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Sorry for the delay.

The discid is A70C310C.
I see from the web search the encoding is the same garbage. :/

Original comment by loliloli...@gmail.com on 19 Dec 2010 at 7:12

GoogleCodeExporter commented 8 years ago
I saw the issue now. The freedb server is serving the file as 
"charset=iso-8859-1" while it's actually GB18030. e.g.

  $ curl http://www.freedb.org/freedb/data/a70c310c | iconv -f GB18030 -t UTF8 | less

Original comment by loliloli...@gmail.com on 19 Dec 2010 at 7:16

GoogleCodeExporter commented 8 years ago
Technical this is a freedb error, since only UTF-8 records are allowed in the 
spec. Older files may still be encoded as ISO-8859-1. Anything else is a buggy 
record. But I guess users just want to have a fix ;)

Original comment by boukewou...@gmail.com on 19 Dec 2010 at 8:02

GoogleCodeExporter commented 8 years ago
I had the same bug, but the fix utf-8-fix.diff correct the bug for French audio 
CD.

Original comment by mayeu....@gmail.com on 25 Dec 2010 at 11:42

GoogleCodeExporter commented 8 years ago
I have the same problem with, rubyripper 0.6.2, flac disabled, a german cd with 
umlauts (ü) and the freedb gateway from musicbrainz. First I got

Should all tracks be ripped ? (y/n)  [y] 
Tracks to rip are 1 2 3 4
/usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1471:in `expand_path': incompatible 
character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1471:in `giveDir'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1447:in `block in setDirectory'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1445:in `each'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1445:in `setDirectory'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1385:in `initialize'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:2528:in `new'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:2528:in `settingsOk'
    from /usr/bin/rrip_cli:410:in `prepareRip'
    from /usr/bin/rrip_cli:331:in `showFreedbOptions'
    from /usr/bin/rrip_cli:305:in `showFreedb'
    from /usr/bin/rrip_cli:265:in `handleFreedb'
    from /usr/bin/rrip_cli:244:in `get_cd_info'
    from /usr/bin/rrip_cli:47:in `initialize'
    from /usr/bin/rrip_cli:486:in `new'
    from /usr/bin/rrip_cli:486:in `<main>'

then I restarted rrip_cli and manually entered all titles with umlauts, which 
resulted in 

01 - Türlich, türlich (sicher, Dicker)
02 - Ich heb mich ab
03 - Nur der Zorn zählt
04 - Session mit Don Dougie (1998)

ADVANCED TOC ANALYSIS (with cdrdao)
...please be patient, this may take a while

/usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1726:in `join': incompatible character 
encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1726:in `getFile'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1016:in `writeFileLine'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1005:in `block in createCuesheet'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:981:in `each'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:981:in `createCuesheet'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:958:in `block in allCodecs'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:954:in `each'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:954:in `allCodecs'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:950:in `initialize'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:511:in `new'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:511:in `updateSettings'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:2565:in `waitForToc'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:2537:in `startRip'
    from /usr/bin/rrip_cli:412:in `prepareRip'
    from /usr/bin/rrip_cli:331:in `showFreedbOptions'
    from /usr/bin/rrip_cli:305:in `showFreedb'
    from /usr/bin/rrip_cli:397:in `editTrackInfo'
    from /usr/bin/rrip_cli:333:in `showFreedbOptions'
    from /usr/bin/rrip_cli:305:in `showFreedb'
    from /usr/bin/rrip_cli:363:in `editDiscInfo'
    from /usr/bin/rrip_cli:332:in `showFreedbOptions'
    from /usr/bin/rrip_cli:305:in `showFreedb'
    from /usr/bin/rrip_cli:265:in `handleFreedb'
    from /usr/bin/rrip_cli:244:in `get_cd_info'
    from /usr/bin/rrip_cli:169:in `edit_settings'
    from /usr/bin/rrip_cli:45:in `initialize'
    from /usr/bin/rrip_cli:486:in `new'
    from /usr/bin/rrip_cli:486:in `<main>'

After restarting rrip again it goes until after the TOC analysis

ADVANCED TOC ANALYSIS (with cdrdao)
...please be patient, this may take a while

/usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1726:in `join': incompatible character 
encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1726:in `getFile'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1016:in `writeFileLine'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:1005:in `block in createCuesheet'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:981:in `each'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:981:in `createCuesheet'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:958:in `block in allCodecs'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:954:in `each'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:954:in `allCodecs'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:950:in `initialize'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:511:in `new'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:511:in `updateSettings'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:2565:in `waitForToc'
    from /usr/lib64/ruby/site_ruby/1.8/rr_lib.rb:2537:in `startRip'
    from /usr/bin/rrip_cli:427:in `dir_exists'
    from /usr/bin/rrip_cli:447:in `update'
    from /usr/bin/rrip_cli:414:in `prepareRip'
    from /usr/bin/rrip_cli:331:in `showFreedbOptions'
    from /usr/bin/rrip_cli:305:in `showFreedb'
    from /usr/bin/rrip_cli:265:in `handleFreedb'
    from /usr/bin/rrip_cli:244:in `get_cd_info'
    from /usr/bin/rrip_cli:47:in `initialize'
    from /usr/bin/rrip_cli:486:in `new'
    from /usr/bin/rrip_cli:486:in `<main>'

Original comment by fschm...@gmail.com on 10 Sep 2012 at 10:29

GoogleCodeExporter commented 8 years ago
I've verified that this issue still exists as of today (2014-02-17).  I've 
fixed the issue for all non-EOL rubies (1.9, 2.0 and 2.1 tested).  The change 
is pretty simple, just two lines.  The fix can be seen on my clone.  I've 
ripped several CDs with diacritical marks (e.g. 
http://www.freedb.org/freedb/rock/8f0e690d) with expected results.

Original comment by t...@wuest.me on 17 Feb 2014 at 11:33

GoogleCodeExporter commented 8 years ago
A similar crash occurs ripping a home-made disk for which NO FREEDB ENTRY 
EXISTS.  So, at least in my case, encoding of data from freedb may not have 
anything to do with the problem.

I typed track names into the gui, using some non-ascii Unicode characters 
entered with Linux compose-key sequences (e.g. ç ő á ó é).

(rrip_gui 0.6.2 - the latest in the GetDb repo for Ubuntu 15.04)

/usr/lib/ruby/1.8/rr_lib.rb:1659:in `gsub!': incompatible encoding regexp match 
(UTF-8 regexp with ASCII-8BIT string) (Encoding::CompatibilityError)
    from /usr/lib/ruby/1.8/rr_lib.rb:1659:in `allFilter'
    from /usr/lib/ruby/1.8/rr_lib.rb:1636:in `tagFilter'
    from /usr/lib/ruby/1.8/rr_lib.rb:1597:in `block in setMetadata'
    from /usr/lib/ruby/1.8/rr_lib.rb:1596:in `times'
    from /usr/lib/ruby/1.8/rr_lib.rb:1596:in `setMetadata'
    from /usr/lib/ruby/1.8/rr_lib.rb:1479:in `attemptDirCreation'
    from /usr/lib/ruby/1.8/rr_lib.rb:1685:in `overwriteDir'
    from /usr/lib/ruby/1.8/rr_lib.rb:2661:in `overwriteDir'
    from /usr/bin/rrip_cli:428:in `dir_exists'
    from /usr/bin/rrip_cli:447:in `update'
    from /usr/bin/rrip_cli:414:in `prepareRip'
    from /usr/bin/rrip_cli:331:in `showFreedbOptions'
    from /usr/bin/rrip_cli:305:in `showFreedb'
    from /usr/bin/rrip_cli:265:in `handleFreedb'
    from /usr/bin/rrip_cli:244:in `get_cd_info'
    from /usr/bin/rrip_cli:47:in `initialize'
    from /usr/bin/rrip_cli:486:in `new'
    from /usr/bin/rrip_cli:486:in `<main>'

Original comment by jim.av...@gmail.com on 14 Jul 2015 at 12:59

GoogleCodeExporter commented 8 years ago
P.S. I think this is a regression.  I've typed non-ASCII characters into 
rrip_gui track names many times in the past.  This is the first time using this 
version, though.

Original comment by jim.av...@gmail.com on 14 Jul 2015 at 1:01