Problem: Assign file UUIDs to objects fails with win_1252, big5, and shiftjis encodings

ross-spencer commented 6 years ago

In qa/1.x we are seeing the following failures at this stage in the transfer process for standard files:

win_1252:

archivematicaAssignFileUUID.py: INFO      2018-05-28 11:04:37,594  archivematica.mcp.client.assignFileUUID:main:142:  Generated UUID for this file: ce0d9432-3cfe-4580-8233-f871c1391da8.
archivematicaAssignFileUUID.py: WARNING   2018-05-28 11:04:37,598  py.warnings:_warning_check:155:  /usr/local/lib/python2.7/dist-packages/django/db/backends/mysql/base.py:124: Warning: (1300L, u"Invalid utf8 character string: 'F87374'")
  return self.cursor.execute(query, args)

Traceback (most recent call last):
  File "/src/MCPClient/lib/clientScripts/archivematicaAssignFileUUID.py", line 174, in <module>
    sys.exit(main(**vars(args)))
  File "/src/MCPClient/lib/clientScripts/archivematicaAssignFileUUID.py", line 143, in main
    addFileToTransfer(file_path_relative_to_sip, file_uuid, transfer_uuid, event_uuid, date, use=use, sourceType=event_type)
  File "/src/archivematicaCommon/lib/fileOperations.py", line 72, in addFileToTransfer
    eventOutcomeDetailNote="")
  File "/src/archivematicaCommon/lib/databaseFunctions.py", line 172, in insertIntoEvents
    agents = getAMAgentsForFile(fileUUID)
  File "/src/archivematicaCommon/lib/databaseFunctions.py", line 119, in getAMAgentsForFile
    f = File.objects.get(uuid=fileUUID)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/manager.py", line 127, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 328, in get
    num = len(clone)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 144, in __len__
    self._fetch_all()
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 965, in _fetch_all
    self._result_cache = list(self.iterator())
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 254, in iterator
    for row in compiler.results_iter(results):
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/compiler.py", line 800, in results_iter
    row = self.apply_converters(row, converters)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/compiler.py", line 784, in apply_converters
    value = converter(value, expression, self.connection, self.query.context)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/mysql/operations.py", line 205, in convert_textfield_value
    value = force_text(value)
  File "/usr/local/lib/python2.7/dist-packages/django/utils/encoding.py", line 102, in force_text
    raise DjangoUnicodeDecodeError(s, *e.args)
django.utils.encoding.DjangoUnicodeDecodeError: 'utf8' codec can't decode byte 0xf8 in position 41: invalid start byte. You passed in '%transferDirectory%objects/windows_1252/s\xf8ster' (<type 'str'>)

Big5:

archivematicaAssignFileUUID.py: INFO      2018-05-28 11:04:33,329  archivematica.mcp.client.assignFileUUID:main:142:  Generated UUID for this file: 4dc138cc-70de-4ce1-9d59-2d0e29307487.
archivematicaAssignFileUUID.py: WARNING   2018-05-28 11:04:33,433  py.warnings:_warning_check:155:  /usr/local/lib/python2.7/dist-packages/django/db/backends/mysql/base.py:124: Warning: (1300L, u"Invalid utf8 character string: 'BC73A6'")
  return self.cursor.execute(query, args)

Traceback (most recent call last):
  File "/src/MCPClient/lib/clientScripts/archivematicaAssignFileUUID.py", line 174, in <module>
    sys.exit(main(**vars(args)))
  File "/src/MCPClient/lib/clientScripts/archivematicaAssignFileUUID.py", line 143, in main
    addFileToTransfer(file_path_relative_to_sip, file_uuid, transfer_uuid, event_uuid, date, use=use, sourceType=event_type)
  File "/src/archivematicaCommon/lib/fileOperations.py", line 72, in addFileToTransfer
    eventOutcomeDetailNote="")
  File "/src/archivematicaCommon/lib/databaseFunctions.py", line 172, in insertIntoEvents
    agents = getAMAgentsForFile(fileUUID)
  File "/src/archivematicaCommon/lib/databaseFunctions.py", line 119, in getAMAgentsForFile
    f = File.objects.get(uuid=fileUUID)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/manager.py", line 127, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 328, in get
    num = len(clone)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 144, in __len__
    self._fetch_all()
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 965, in _fetch_all
    self._result_cache = list(self.iterator())
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 254, in iterator
    for row in compiler.results_iter(results):
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/compiler.py", line 800, in results_iter
    row = self.apply_converters(row, converters)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/compiler.py", line 784, in apply_converters
    value = converter(value, expression, self.connection, self.query.context)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/mysql/operations.py", line 205, in convert_textfield_value
    value = force_text(value)
  File "/usr/local/lib/python2.7/dist-packages/django/utils/encoding.py", line 102, in force_text
    raise DjangoUnicodeDecodeError(s, *e.args)
django.utils.encoding.DjangoUnicodeDecodeError: 'utf8' codec can't decode byte 0xbc in position 32: invalid start byte. You passed in '%transferDirectory%objects/big5/\xbcs\xa6{' (<type 'str'>)

shiftjis:

archivematicaAssignFileUUID.py: INFO      2018-05-28 11:04:33,282  archivematica.mcp.client.assignFileUUID:main:142:  Generated UUID for this file: 443e3342-095c-4921-9dc2-5ab9c386c6d3.
archivematicaAssignFileUUID.py: WARNING   2018-05-28 11:04:33,315  py.warnings:_warning_check:155:  /usr/local/lib/python2.7/dist-packages/django/db/backends/mysql/base.py:124: Warning: (1300L, u"Invalid utf8 character string: '82DB82'")
  return self.cursor.execute(query, args)

Traceback (most recent call last):
  File "/src/MCPClient/lib/clientScripts/archivematicaAssignFileUUID.py", line 174, in <module>
    sys.exit(main(**vars(args)))
  File "/src/MCPClient/lib/clientScripts/archivematicaAssignFileUUID.py", line 143, in main
    addFileToTransfer(file_path_relative_to_sip, file_uuid, transfer_uuid, event_uuid, date, use=use, sourceType=event_type)
  File "/src/archivematicaCommon/lib/fileOperations.py", line 72, in addFileToTransfer
    eventOutcomeDetailNote="")
  File "/src/archivematicaCommon/lib/databaseFunctions.py", line 172, in insertIntoEvents
    agents = getAMAgentsForFile(fileUUID)
  File "/src/archivematicaCommon/lib/databaseFunctions.py", line 119, in getAMAgentsForFile
    f = File.objects.get(uuid=fileUUID)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/manager.py", line 127, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 328, in get
    num = len(clone)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 144, in __len__
    self._fetch_all()
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 965, in _fetch_all
    self._result_cache = list(self.iterator())
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/query.py", line 254, in iterator
    for row in compiler.results_iter(results):
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/compiler.py", line 800, in results_iter
    row = self.apply_converters(row, converters)
  File "/usr/local/lib/python2.7/dist-packages/django/db/models/sql/compiler.py", line 784, in apply_converters
    value = converter(value, expression, self.connection, self.query.context)
  File "/usr/local/lib/python2.7/dist-packages/django/db/backends/mysql/operations.py", line 205, in convert_textfield_value
    value = force_text(value)
  File "/usr/local/lib/python2.7/dist-packages/django/utils/encoding.py", line 102, in force_text
    raise DjangoUnicodeDecodeError(s, *e.args)
django.utils.encoding.DjangoUnicodeDecodeError: 'utf8' codec can't decode byte 0x82 in position 37: invalid start byte. You passed in '%transferDirectory%objects/shift_jis/\x82\xdb\x82\xc1\x82\xd5\x82\xe9\x83\x81\x83C\x83\x8b' (<type 'str'>)

ross-spencer commented 6 years ago

Related artefactual/archivematica#1104

ross-spencer commented 6 years ago

Some notes:

You could try decoding the string naively though every encoding in Python, but:

from encodings import aliases
def naive_decode(uknown_string):
    for alias in aliases.aliases:
        try:
            return string.decode(alias), alias
        except:
            pass

It will decode to something, but not necessarily something intelligible:

Output: Ë8ËÈÁÊ
Tuple: (u'\xcb8\xcb\xc8\xc1\xca', '1140')

Ref for CP1140

You could do something similar with a subset: ['utf-8', 'ascii', 'cp1252'] but then would this just be weighted toward utf-8 and then English? Latin-alphabet character sets.

ross-spencer commented 6 years ago

I've traced the issue to here: https://github.com/artefactual/archivematica/blob/4d14d18e319604602be4576df2a9ced60b98ed2e/src/archivematicaCommon/lib/databaseFunctions.py#L118-L122

via:

https://github.com/artefactual/archivematica/blob/051ded0f3e079f43594e0c90e862cc4db33ebd93/src/archivematicaCommon/lib/databaseFunctions.py#L146-L166

which is via:

https://github.com/artefactual/archivematica/blob/051ded0f3e079f43594e0c90e862cc4db33ebd93/src/archivematicaCommon/lib/fileOperations.py#L64-L72

Which I think demonstrates that this is happening on retrieval from the database, but I think this is a compound effect from how we're storing the string in the database.

More to follow...

ross-spencer commented 6 years ago

An external library that solved a similar problem by handling issues up-front (at entry to the DB and on display), but also examines alternatives such as requiring UTF-8 only - does this point to some form of pre-conditioning (with provenance)?

Ref: http://beets.io/blog/paths.html

ablwr commented 5 years ago

It seems like there may be an opportunity to rectify this when we move to Python3, then.

archivematica / Issues

Problem: Assign file UUIDs to objects fails with win_1252, big5, and shiftjis encodings #1367