LibreCat / Catmandu

Catmandu - a data processing toolkit
https://librecat.org
177 stars 31 forks source link

Crap detector #50

Open phochste opened 10 years ago

phochste commented 10 years ago

Would be nice to have some crap detector software to find bad characters or encoding problems.

E.g. --find patterns of double encoded UTF08 in data input --find valid but illegal control fields (e.g. these darlings kept me busy for some hours today: http://unicode-search.net/unicode-namesearch.pl?term=separator)

Any ideas which modules could be of help?

pietsch commented 10 years ago

How about this (culled from https://unix.stackexchange.com/questions/6516/filtering-invalid-utf8 )?

perl -l -ne '/
 ^( ([\x00-\x7F])              # 1-byte pattern
   |([\xC2-\xDF][\x80-\xBF])   # 2-byte pattern
   |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
   |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))       # 4-byte pattern
  )*$ /x or print'

Regarding control fields, Wikipedia has some material: https://en.wikipedia.org/wiki/C0_and_C1_control_codes https://en.wikipedia.org/wiki/Unicode_control_characters

I am not sure what you mean by “valid but illegal”. Does it depend on context?

phochste commented 10 years ago

Yes along these lines. Valid but illegal: I meant valid e.g. 0x1E (Record Separator) is valid but you wouldn't expect this as a MARC field value (this gave us issues with indexing the data in Solr).

I was thinking of some kind of Catmandu Cmd that could do an analysis of a bytestream and just provide you with some statistical information on characters used. This many alphanum, this many control codes, this many illegal utf8.

pietsch commented 10 years ago

Will Encode::Guess perhaps do the job?

On an openSUSE server, you have /usr/bin/guess_encoding. It's a GPLv2 Perl script but weirdly I cannot find the code online, so I put it in a gist.