Open phochste opened 10 years ago
How about this (culled from https://unix.stackexchange.com/questions/6516/filtering-invalid-utf8 )?
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print'
Regarding control fields, Wikipedia has some material: https://en.wikipedia.org/wiki/C0_and_C1_control_codes https://en.wikipedia.org/wiki/Unicode_control_characters
I am not sure what you mean by “valid but illegal”. Does it depend on context?
Yes along these lines. Valid but illegal: I meant valid e.g. 0x1E (Record Separator) is valid but you wouldn't expect this as a MARC field value (this gave us issues with indexing the data in Solr).
I was thinking of some kind of Catmandu Cmd that could do an analysis of a bytestream and just provide you with some statistical information on characters used. This many alphanum, this many control codes, this many illegal utf8.
Will Encode::Guess perhaps do the job?
On an openSUSE server, you have /usr/bin/guess_encoding. It's a GPLv2 Perl script but weirdly I cannot find the code online, so I put it in a gist.
Would be nice to have some crap detector software to find bad characters or encoding problems.
E.g. --find patterns of double encoded UTF08 in data input --find valid but illegal control fields (e.g. these darlings kept me busy for some hours today: http://unicode-search.net/unicode-namesearch.pl?term=separator)
Any ideas which modules could be of help?