Open ross-spencer opened 6 years ago
Related artefactual/archivematica#1104
Some notes:
You could try decoding the string naively though every encoding in Python, but:
from encodings import aliases
def naive_decode(uknown_string):
for alias in aliases.aliases:
try:
return string.decode(alias), alias
except:
pass
It will decode to something, but not necessarily something intelligible:
Output: Ë8ËÈÁÊ
Tuple: (u'\xcb8\xcb\xc8\xc1\xca', '1140')
Ref for CP1140
You could do something similar with a subset: ['utf-8', 'ascii', 'cp1252']
but then would this just be weighted toward utf-8 and then English? Latin-alphabet character sets.
I've traced the issue to here: https://github.com/artefactual/archivematica/blob/4d14d18e319604602be4576df2a9ced60b98ed2e/src/archivematicaCommon/lib/databaseFunctions.py#L118-L122
via:
which is via:
Which I think demonstrates that this is happening on retrieval from the database, but I think this is a compound effect from how we're storing the string in the database.
More to follow...
An external library that solved a similar problem by handling issues up-front (at entry to the DB and on display), but also examines alternatives such as requiring UTF-8 only - does this point to some form of pre-conditioning (with provenance)?
It seems like there may be an opportunity to rectify this when we move to Python3, then.
In
qa/1.x
we are seeing the following failures at this stage in the transfer process for standard files:win_1252:
Big5:
shiftjis: