Closed ethervoid closed 9 years ago
We have to make a bunch of regression tests to validate this change don't break anything
I will implement some tests to verify Chardet works in delicate cases. The idea is to ease Chardet updates: if future versions of Chardet break this tests and we'll know something is wrong.
@rafatower @ethervoid
Found a list of Chardet supported encodings. There are over a dozen. Should we create tests for all of them?
The list:
Big5, GB2312/GB18030, EUC-TW, HZ-GB-2312, and ISO-2022-CN (Traditional and Simplified Chinese) EUC-JP, SHIFT_JIS, and ISO-2022-JP (Japanese) EUC-KR and ISO-2022-KR (Korean) KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, and windows-1251 (Russian) ISO-8859-2 and windows-1250 (Hungarian) ISO-8859-5 and windows-1251 (Bulgarian) windows-1252 ISO-8859-7 and windows-1253 (Greek) ISO-8859-8 and windows-1255 (Visual and Logical Hebrew) TIS-620 (Thai) UTF-32 BE, LE, 3412-ordered, or 2143-ordered (with a BOM) UTF-16 BE or LE (with a BOM) UTF-8 (with or without a BOM) ASCII
How long would it take to you to develop all of them? what's the benefit of having all of them? and the cost of not having some of them?
How long would it take to develop just one test for a supported encoding? and adding then another one? I suspect that the cost of the first is far greater than the cost of adding the 2nd, 3rd and so on... what do you think?
@ethervoid I don't see the relationship between this error and the description of the issue ERROR 1: no connection to the server
@rafatower
I think that the time spent on this will be worth it.
The only time that worries me is the time spent doing these tests on CI. I know that there has been an effort to bring that time down do it would be a pity to add "useless" tests.
Maybe we could get some info from support to know which of these encodings are actually used and which ones not?
cc @iriberri
Tests will be almost an order of magnitude faster.
This test battery was requested by Javi Santana in order to upgrade chardet to a new version to avoid possible errors with working things.
The ogr2ogr error is misleading (2015-07-28 16:39:28 UTC: Detected encoding IBM855) because at the end the error is related to the detected encoding (maybe I should have put a better description for this problems.
Another useful piece of information:
http://www.postgresql.org/docs/9.4/static/multibyte.html
I'm also wondering if going through pgbouncer will respect that option in the environment. I'd say it doesn't.
@rafatower @ethervoid
In that case all good from my side! Will start developing tests for all encodings in supported list.
No idea about the encodings, we don't check those things. We sometimes have encoding errors in Shapefile imports in which the importer recognises encodings like IBMXXXX, but there's not really an encoding that people uses commonly and fails or something like that.
What we want to have is regression tests to make the update of chardet python package secure. Chardet's job is to check what type of encoding the SHP file have.
@ethervoid I think we also use it in csv_normalizer
so I guess it for CSV too :smile:
@rafatower @ethervoid
So my current plan is to find or forge csv and shp encoded in all of the listed encodings and check the result with the ground truth.
Any idea of how to obtain or forge these easily would be greatly appreciated
@ethervoid:
@rafatower 's link points to chardet's battery of tests. We've had a small chat and reached the conclusion that tests should be as close to the source as posible. If we find any test is missing, we should contribute it to chadet's repo.
Could you provide some context for this?
Also, is it posible that when referencing chardet you instead were thinking about charlock_holmes? I can't find any gen called chardet, just the python library.
It was chardet for sure (the python package). The request was to make regression test to avoid break anything when we update the gem (I think that's the main concern for @javisantana)
@ethervoid
What gem? I can't find it in the Gemfile. Are we actually using it?
Not a gem, it's a python package
Apparently it is only used for shp_normalizer.py called from shp_normalizer.rb to figure encoding and SRID (it does not actually normalize anything BTW).
Since this is covered in the chardet tests for python package, I don't think we should add specific tests in our ruby code for it.
Am I missing something @ethervoid @javisantana? I'd close this ticket.
Also, the error mentioned in the issue description can come from this:
$ PGCLIENTENCODING=IBM855 psql -U postgres tests
psql: FATAL: invalid value for parameter "client_encoding": "IBM855"
because this is an unsupported charset as stated in postgres docs and in the code itself
I bet this error is expressed as ERROR 1: no connection to the server
when going through ogr2ogr2 and/or pgbouncer in a dedicated server.
The encoding issue can also come from other places figuring out the encoding, such as here
So let's first clearly define the problem we're trying to solve and change the issue description accordingly, please.
OK but this was solved for me using the last commit from chardet at the time because the IBMXX detection comes from chardet so if chardet works properly this should never happen
@ethervoid glad to hear it's solved!
From the conversation we had: In principle we should update chardet only when needed and trust their tests. We should test for a bit after updating chardet and should we find any bug, report to them and revert to the previous version.
I'm closing this ticket since there is nothing to be fixed. If needed we can reopen it or create another one in the future.
We need to update the
chardet
package to pin-point to the last commit to fix error messages like: