Create tests to verify Chardet is working correctly for rare encodings

ethervoid commented 9 years ago

We need to update the chardet package to pin-point to the last commit to fix error messages like:

2015-07-28 16:39:25 UTC: File-based import load
2015-07-28 16:39:28 UTC: Detected encoding IBM855
2015-07-28 16:39:28 UTC: Using database connection with {:adapter=>"postgres", :encoding=>"unicode", :host=>"127.0.0.1", :database=>"cartodb_staging_user_8fd80c17-146e-4d4a-83a9-23b90d447358_db", :username=>"postgres", :port=>6432, :user=>"cartodb_staging_user_8fd80c17-146e-4d4a-83
2015-07-28 16:39:31 UTC: ogr2ogr call:            OSM_USE_CUSTOM_INDEXING=NO PG_USE_COPY=YES PGCLIENTENCODING=IBM855  /usr/bin/ogr2ogr -f PostgreSQL   PG:"host=127.0.0.1 port=6432 user=cartodb_staging_user_8fd80c17-146e-4d4a-83a9-23b90d447358 dbname=cartodb_staging_user_8fd80c17-14
2015-07-28 16:39:31 UTC: ogr2ogr output:          ERROR 1: no connection to the server

ERROR 1: PostgreSQL driver doesn't currently support database creation.
Please create database with the `createdb' command.```

ethervoid commented 9 years ago

We have to make a bunch of regression tests to validate this change don't break anything

gfiorav commented 9 years ago

I will implement some tests to verify Chardet works in delicate cases. The idea is to ease Chardet updates: if future versions of Chardet break this tests and we'll know something is wrong.

gfiorav commented 9 years ago

@rafatower @ethervoid

Found a list of Chardet supported encodings. There are over a dozen. Should we create tests for all of them?

The list:

Big5, GB2312/GB18030, EUC-TW, HZ-GB-2312, and ISO-2022-CN (Traditional and Simplified Chinese) EUC-JP, SHIFT_JIS, and ISO-2022-JP (Japanese) EUC-KR and ISO-2022-KR (Korean) KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, and windows-1251 (Russian) ISO-8859-2 and windows-1250 (Hungarian) ISO-8859-5 and windows-1251 (Bulgarian) windows-1252 ISO-8859-7 and windows-1253 (Greek) ISO-8859-8 and windows-1255 (Visual and Logical Hebrew) TIS-620 (Thai) UTF-32 BE, LE, 3412-ordered, or 2143-ordered (with a BOM) UTF-16 BE or LE (with a BOM) UTF-8 (with or without a BOM) ASCII

rafatower commented 9 years ago

How long would it take to you to develop all of them? what's the benefit of having all of them? and the cost of not having some of them?

How long would it take to develop just one test for a supported encoding? and adding then another one? I suspect that the cost of the first is far greater than the cost of adding the 2nd, 3rd and so on... what do you think?

@ethervoid I don't see the relationship between this error and the description of the issue ERROR 1: no connection to the server

gfiorav commented 9 years ago

@rafatower

I think that the time spent on this will be worth it.

The only time that worries me is the time spent doing these tests on CI. I know that there has been an effort to bring that time down do it would be a pity to add "useless" tests.

Maybe we could get some info from support to know which of these encodings are actually used and which ones not?

cc @iriberri

rafatower commented 9 years ago

Tests will be almost an order of magnitude faster.

ethervoid commented 9 years ago

This test battery was requested by Javi Santana in order to upgrade chardet to a new version to avoid possible errors with working things.

The ogr2ogr error is misleading (2015-07-28 16:39:28 UTC: Detected encoding IBM855) because at the end the error is related to the detected encoding (maybe I should have put a better description for this problems.

rafatower commented 9 years ago

Another useful piece of information:

https://github.com/CartoDB/cartodb/blob/03fb5e41e84c8cfa8072d8b41f8f6041d8f8affe/services/importer/lib/importer/ogr2ogr.rb#L166-L168

http://www.postgresql.org/docs/9.4/static/multibyte.html

I'm also wondering if going through pgbouncer will respect that option in the environment. I'd say it doesn't.

gfiorav commented 9 years ago

@rafatower @ethervoid

In that case all good from my side! Will start developing tests for all encodings in supported list.

iriberri commented 9 years ago

No idea about the encodings, we don't check those things. We sometimes have encoding errors in Shapefile imports in which the importer recognises encodings like IBMXXXX, but there's not really an encoding that people uses commonly and fails or something like that.

ethervoid commented 9 years ago

What we want to have is regression tests to make the update of chardet python package secure. Chardet's job is to check what type of encoding the SHP file have.

gfiorav commented 9 years ago

@ethervoid I think we also use it in csv_normalizer so I guess it for CSV too :smile:

gfiorav commented 9 years ago

@rafatower @ethervoid

So my current plan is to find or forge csv and shp encoded in all of the listed encodings and check the result with the ground truth.

Any idea of how to obtain or forge these easily would be greatly appreciated

rafatower commented 9 years ago

https://github.com/chardet/chardet/tree/master/tests

gfiorav commented 9 years ago

@ethervoid:

@rafatower 's link points to chardet's battery of tests. We've had a small chat and reached the conclusion that tests should be as close to the source as posible. If we find any test is missing, we should contribute it to chadet's repo.

Could you provide some context for this?

Also, is it posible that when referencing chardet you instead were thinking about charlock_holmes? I can't find any gen called chardet, just the python library.

ethervoid commented 9 years ago

It was chardet for sure (the python package). The request was to make regression test to avoid break anything when we update the gem (I think that's the main concern for @javisantana)

gfiorav commented 9 years ago

@ethervoid

What gem? I can't find it in the Gemfile. Are we actually using it?

ethervoid commented 9 years ago

Not a gem, it's a python package

rafatower commented 9 years ago

Apparently it is only used for shp_normalizer.py called from shp_normalizer.rb to figure encoding and SRID (it does not actually normalize anything BTW).

Since this is covered in the chardet tests for python package, I don't think we should add specific tests in our ruby code for it.

Am I missing something @ethervoid @javisantana? I'd close this ticket.

rafatower commented 9 years ago

Also, the error mentioned in the issue description can come from this:

$ PGCLIENTENCODING=IBM855 psql -U postgres tests                                                                                                                                                            
psql: FATAL:  invalid value for parameter "client_encoding": "IBM855"

because this is an unsupported charset as stated in postgres docs and in the code itself

I bet this error is expressed as ERROR 1: no connection to the server when going through ogr2ogr2 and/or pgbouncer in a dedicated server.

The encoding issue can also come from other places figuring out the encoding, such as here

So let's first clearly define the problem we're trying to solve and change the issue description accordingly, please.

ethervoid commented 9 years ago

OK but this was solved for me using the last commit from chardet at the time because the IBMXX detection comes from chardet so if chardet works properly this should never happen

gfiorav commented 9 years ago

@ethervoid glad to hear it's solved!

From the conversation we had: In principle we should update chardet only when needed and trust their tests. We should test for a bit after updating chardet and should we find any bug, report to them and revert to the previous version.

I'm closing this ticket since there is nothing to be fixed. If needed we can reopen it or create another one in the future.

CartoDB / cartodb

Create tests to verify Chardet is working correctly for rare encodings #4753