archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: Change name microservice gets confused with similar names with diacritics (Café and Café) causing a trickle down effect until METS generation failure #1352

Open ross-spencer opened 3 years ago

ross-spencer commented 3 years ago

Expected behaviour

Current behaviour

These two names look the same:

But in hex they look as follows:

43 61 66 65 cc 81 => |Cafe...|
43 61 66 c3 a9    => |Caf...|

We can deconstruct them to their UTF8 component parts and see the two 'e' letters with diacritics are created using different symbols:

* LATIN SMALL LETTER E + COMBINING ACUTE ACCENTs
* LATIN SMALL LETTER E WITH ACUTE

This is a transfer ready-made which will cause the transfer process to fail once extracted and run: cafe_fail.zip

mary-berry

Steps to reproduce

Run the two files in the zip above as a transfer. It will fail during METS generation in the ingest tab as two results are returned from the database for the same file: MultipleObjectsReturned: get() returned more than one File -- it returned 2!

If you have a look through the different microservice jobs too you will see various different database failures, e.g. in file format identification one of the file object: DoesNotExist: File matching query does not exist.

Your environment (version of Archivematica, operating system, other relevant details)

Archivematica 1.12. Docker, and at a client site.

Additional context

The change name microservice seems to be creating this issue as we know that diacritics are not preserved on the file system and instead are maintained in the database and metadata.

By the time these files are processed and written to the database they end up with the same current location, even though other properties are different (nb, the pipe alignment in the cells below haven't been introduced by me).

mysql> select * from Files where currentLocation like "%Ca%";
+--------------------------------------+-----------------------------------+-----------------------------------+------------+-------------+------------------------------------------------------------------+----------+-------+----------------------------+-------------+---------+--------------------------------------+--------------+----------------------------+
| fileUUID                             | originalLocation                  | currentLocation                   | fileGrpUse | fileGrpUUID | checksum                                                         | fileSize | label | enteredSystem              | removedTime | sipUUID | transferUUID                         | checksumType | modificationTime           |
+--------------------------------------+-----------------------------------+-----------------------------------+------------+-------------+------------------------------------------------------------------+----------+-------+----------------------------+-------------+---------+--------------------------------------+--------------+----------------------------+
| 9177f186-ea97-4e64-b86a-112871e3da71 | %transferDirectory%objects/Café | %transferDirectory%objects/Cafe_1 | original   |             | 4d9e3739df17dc6f3d6723a659e16c4cdcfe16467b1fc88c54412f41b553024a |        7 |       | 2021-01-27 22:54:48.741064 | NULL        | NULL    | 07b758e9-d772-435a-a476-4c162166bd9d | sha256       | 2021-01-27 22:54:48.741034 |
| da0a5258-b396-4acc-9734-9db619e1fc59 | %transferDirectory%objects/Café  | %transferDirectory%objects/Cafe_1 | original   |             | ab4ff0780be67e1eef32bd012331f8896311f5fbe326c1d65dc542b99987aca3 |        6 |       | 2021-01-27 22:54:48.728059 | NULL        | NULL    | 07b758e9-d772-435a-a476-4c162166bd9d | sha256       | 2021-01-27 22:54:48.728026 |
+--------------------------------------+-----------------------------------+-----------------------------------+------------+-------------+------------------------------------------------------------------+----------+-------+----------------------------+-------------+---------+--------------------------------------+--------------+----------------------------+
2 rows in set (0.00 sec)

I have tracked this down as much as I have energy for this evening and it seems the correct data all exists up until here in the workflow at which point the dictionary lookup fails for the information the microservice job is trying to find).

Here is some logging I put in to confirm this:

sanitizeobjectnames_v0.0: ERROR     2021-01-27 23:11:36,850  archivematica.mcp.client.sanitizeObjectNames.apply_file_updates:142  Dictionary: {u'%transferDirectory%objects/Caf\xe9': u'%transferDirectory%objects/Cafe_1'} %transferDirectory%objects/Café
sanitizeobjectnames_v0.0: ERROR     2021-01-27 23:11:36,850  archivematica.mcp.client.sanitizeObjectNames.apply_file_updates:143  Lookup result 1: %transferDirectory%objects/Cafe_1
sanitizeobjectnames_v0.0: ERROR     2021-01-27 23:11:36,851  archivematica.mcp.client.sanitizeObjectNames.apply_file_updates:142  Dictionary: {u'%transferDirectory%objects/Caf\xe9': u'%transferDirectory%objects/Cafe_1'} %transferDirectory%objects/Café
sanitizeobjectnames_v0.0: ERROR     2021-01-27 23:11:36,852  archivematica.mcp.client.sanitizeObjectNames.apply_file_updates:143  Lookup result 2: %transferDirectory%objects/Cafe_1

For Artefactual use:

Before you close this issue, you must check off the following:

sromkey commented 3 years ago

:exploding_head: