brianmario / charlock_holmes

Character encoding detection, brought to you by ICU
MIT License
1.04k stars 142 forks source link

"jobs".detect_encoding! => ArgumentError: unknown encoding name - IBM420_ltr #38

Open tjoneseng opened 11 years ago

tjoneseng commented 11 years ago

I don't know if this is a charlock_holmes issue or an ICU issue. For some reason the magic string "jobs" causes an explosion. I have no idea why it would insist that is IBM420_ltr (which I've never even heard of before) but the singular version is not:

"job".detect_encoding => {:type=>:text, :encoding=>"UTF-8", :confidence=>10}

I tried giving it a hint but no-go:

"jobs".detect_encoding("UTF-8") => {:type=>:text, :encoding=>"IBM420_ltr", :confidence=>60, :language=>"ar"}
brianmario commented 10 years ago

This is definitely interesting... Have you tired this on different ICU versions to see if it's any different?

augus-zz commented 9 years ago

CharlockHolmes::Converter.convert content, "IBM420_ltr", 'UTF-8' i got a error ArgumentError (U_FILE_ACCESS_ERROR):

stanhu commented 8 years ago

If you have built libicu with the --with-data-packaging=filesconfiguration option, you may need to set the ICU_DATA environment variable. See http://userguide.icu-project.org/icudata for more details.

We ran into the same issue at GitLab: https://gitlab.com/gitlab-org/gitlab-ce/issues/17415#note_13867854

phillipp commented 6 years ago

@stanhu

Seems like I'm the only idiot that doesn't know where I find the f**** data directory. What the hell do I set the env variable to? Searching for this directory for hours now...

stanhu commented 6 years ago

@phillipp What platform are you using? On Ubuntu 16.04, I see it in /usr/lib/x86_64-linux-gnu:

$ dpkg -L libicu55
/.
/usr
/usr/share
/usr/share/lintian
/usr/share/lintian/overrides
/usr/share/lintian/overrides/libicu55
/usr/share/doc
/usr/share/doc/libicu55
/usr/share/doc/libicu55/copyright
/usr/share/doc/libicu55/changelog.Debian.gz
/usr/share/doc/libicu55/NEWS.Debian.gz
/usr/lib
/usr/lib/x86_64-linux-gnu
/usr/lib/x86_64-linux-gnu/libicui18n.so.55.1
/usr/lib/x86_64-linux-gnu/libicutest.so.55.1
/usr/lib/x86_64-linux-gnu/libiculx.so.55.1
/usr/lib/x86_64-linux-gnu/libicutu.so.55.1
/usr/lib/x86_64-linux-gnu/libicudata.so.55.1
/usr/lib/x86_64-linux-gnu/libicule.so.55.1
/usr/lib/x86_64-linux-gnu/libicuuc.so.55.1
/usr/lib/x86_64-linux-gnu/libicuio.so.55.1
/usr/lib/x86_64-linux-gnu/libicudata.so.55
/usr/lib/x86_64-linux-gnu/libicui18n.so.55
/usr/lib/x86_64-linux-gnu/libicuuc.so.55
/usr/lib/x86_64-linux-gnu/libicutest.so.55
/usr/lib/x86_64-linux-gnu/libicule.so.55
/usr/lib/x86_64-linux-gnu/libiculx.so.55
/usr/lib/x86_64-linux-gnu/libicuio.so.55
/usr/lib/x86_64-linux-gnu/libicutu.so.55

On MacOS:

$ brew list icu4c
/usr/local/Cellar/icu4c/59.1/bin/derb
/usr/local/Cellar/icu4c/59.1/bin/genbrk
/usr/local/Cellar/icu4c/59.1/bin/gencfu
/usr/local/Cellar/icu4c/59.1/bin/gencnval
/usr/local/Cellar/icu4c/59.1/bin/gendict
/usr/local/Cellar/icu4c/59.1/bin/genrb
/usr/local/Cellar/icu4c/59.1/bin/icu-config
/usr/local/Cellar/icu4c/59.1/bin/icuinfo
/usr/local/Cellar/icu4c/59.1/bin/makeconv
/usr/local/Cellar/icu4c/59.1/bin/pkgdata
/usr/local/Cellar/icu4c/59.1/bin/uconv
/usr/local/Cellar/icu4c/59.1/include/unicode/ (175 files)
/usr/local/Cellar/icu4c/59.1/lib/libicudata.59.1.dylib
/usr/local/Cellar/icu4c/59.1/lib/libicui18n.59.1.dylib
/usr/local/Cellar/icu4c/59.1/lib/libicuio.59.1.dylib
/usr/local/Cellar/icu4c/59.1/lib/libicutest.59.1.dylib
/usr/local/Cellar/icu4c/59.1/lib/libicutu.59.1.dylib
/usr/local/Cellar/icu4c/59.1/lib/libicuuc.59.1.dylib
/usr/local/Cellar/icu4c/59.1/lib/icu/ (4 files)
/usr/local/Cellar/icu4c/59.1/lib/pkgconfig/ (3 files)
/usr/local/Cellar/icu4c/59.1/lib/ (18 other files)
/usr/local/Cellar/icu4c/59.1/sbin/escapesrc
/usr/local/Cellar/icu4c/59.1/sbin/genccode
/usr/local/Cellar/icu4c/59.1/sbin/gencmn
/usr/local/Cellar/icu4c/59.1/sbin/gennorm2
/usr/local/Cellar/icu4c/59.1/sbin/gensprep
/usr/local/Cellar/icu4c/59.1/sbin/icupkg
/usr/local/Cellar/icu4c/59.1/share/icu/ (4 files)
/usr/local/Cellar/icu4c/59.1/share/man/ (14 files)
phillipp commented 6 years ago

@stanhu Ouch, I though it would be some kind of data file, not a lib and looked in /usr/share. Thanks for the help!

kmayer commented 6 years ago

I just got bitten by this, too. I believe that icu can't handle short strings well.

CharlockHolmes::EncodingDetector.detect("Esha")
=> {:type=>:text,
 :encoding=>"IBM424_ltr",
 :ruby_encoding=>"binary",
 :confidence=>60,
 :language=>"he"}

My solution here is to grab a larger section of the work, analyze and convert all at once and hope for the best. It has, so far, helped.

Here's some sample code from a CSV file uploader...

  def initialize(http_uploaded_file)
    http_uploaded_file.to_io.binmode
    detection = CharlockHolmes::EncodingDetector.detect(http_uploaded_file.read)
    http_uploaded_file.rewind
    @text = CharlockHolmes::Converter.convert(http_uploaded_file.read, detection[:encoding], 'UTF-8')
  end
sok44 commented 5 years ago

Good day.

I have Ubuntu 18.04. When using gem with the operation CharlockHolmes :: Converter.convert content, "IBM420_ltr", 'UTF-8', I get the error U_FILE_ACCESS_ERROR. I use Vagrant. ICU 60.02 was already installed, how it was built, I don `t know (--with-data-packaging = files?). I read the comments I decided to install ICU_DATA.

vagrant@rails-dev-box:/vagrant/UploadFiles$ locate "icu"

/usr/lib/x86_64-linux-gnu/libicudata.so.60
/usr/lib/x86_64-linux-gnu/libicudata.so.60.2
/usr/lib/x86_64-linux-gnu/libicui18n.so.60
/usr/lib/x86_64-linux-gnu/libicui18n.so.60.2
/usr/lib/x86_64-linux-gnu/libicuio.so.60
/usr/lib/x86_64-linux-gnu/libicuio.so.60.2
/usr/lib/x86_64-linux-gnu/libicutest.so.60
/usr/lib/x86_64-linux-gnu/libicutest.so.60.2
/usr/lib/x86_64-linux-gnu/libicutu.so.60
/usr/lib/x86_64-linux-gnu/libicutu.so.60.2
/usr/lib/x86_64-linux-gnu/libicuuc.so.60
/usr/lib/x86_64-linux-gnu/libicuuc.so.60.2
/usr/share/doc/libicu60
/usr/share/doc/libicu60/changelog.Debian.gz
/usr/share/doc/libicu60/copyright
/usr/share/lintian/overrides/libicu60
/usr/src/linux-headers-4.15.0-36/include/dt-bindings/interrupt-controller/mvebu-icu.h
/var/lib/dpkg/info/libicu60:amd64.list
/var/lib/dpkg/info/libicu60:amd64.md5sums
/var/lib/dpkg/info/libicu60:amd64.shlibs
/var/lib/dpkg/info/libicu60:amd64.triggers

From the comment above, i decided to install ICU_DATA=/usr/lib/x86_64-linux-gnu/ . It did not help. Also installing /usr/share/icu/ and /usr/share/icu/60.2/ did not help either. I added the export ICU_DATA=/usr/lib/x86_64-linux-gnu/ in to the /etc/environment file. By command env ICU_DATA can be seen.

I also tried it on ubunte 16.04. There, following the instructions, before installing the gem, I ran apt-get install libicu-dev. Gem was installed but the error remained. I tried running the command sudo bundle config build.charlock_holmes --with-icu-lib=/usr/lib/x86_64-linux-gnu/ or --with-icu-dir=/usr/lib/x86_64-linux-gnu/

The error goes on the string 35 characters.

I am a novice. I do not understand, ICU_DATA here the path to which files should be? or here --with-icu-dir.? Path to which files? or how to properly reinstall ICU?