archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: Archivematica hangs on 'characterize and extract metadata' when it encounters a &symbol #38

Open jarrodharvey opened 6 years ago

jarrodharvey commented 6 years ago

Expected behaviour Archivematica workflow should run until completion.

Current behaviour Archivematica hangs at this point: image

MCPServer.debug.log contains the following error message:

< field name="Correspondence To - iMIS Number">\r\n \r\n \r\n \r\n 542facba-391c-4236-9f41-509c46f20258\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n D:20160726004730\r\n \r\n \r\n 3;#Form|e2231b15-9433-4e50-9278-a5702bf4dd62\r\n \r\n \r\n Elizabeth Milford\r\n \r\n \r\n Adobe PDF Library 10.0\r\n \r\n \r\n \r\n \r\n\r\n\r\n\n', 'exitCode': 1, 'stdError': '\nTraceback ( most recent call last):\n File "/usr/lib/archivematica/MCPClient/clientScripts/characterizeFile.py", line 101, in \n sys.exit(main(file_path, file_uuid, sip_uuid))\n File "/usr/lib/archivemati ca/MCPClient/clientScripts/characterizeFile.py", line 81, in main\n insertIntoFPCommandOutput(file_uuid, stdout, rule.uuid)\n File "/usr/lib/archivematica/archivematicaCommon/databaseFunctions.py", lin e 211, in insertIntoFPCommandOutput\n rule_id=ruleUUID)\n File "/usr/share/archivematica/virtualenvs/archivematica-mcp-client/local/lib/python2.7/site-packages/django/db/models/manager.py", line 127, i n manager_method\n return getattr(self.get_queryset(), name)(*args, *kwargs)\n File "/usr/share/archivematica/virtualenvs/archivematica-mcp-client/local/lib/python2.7/site-packages/django/db/models/qu ery.py", line 348, in create\n obj.save(force_insert=True, using=self.db)\n File "/usr/share/archivematica/virtualenvs/archivematica-mcp-client/local/lib/python2.7/site-packages/django/db/models/base.p y", line 734, in save\n force_update=force_update, update_fields=update_fields)\n File "/usr/share/archivematica/virtualenvs/archivematica-mcp-client/local/lib/python2.7/site-packages/django/db/models/ base.py", line 762, in save_base\n updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)\n File "/usr/share/archivematica/virtualenvs/archivematica-mcp-client/local/lib /python2.7/site-packages/django/db/models/base.py", line 846, in _save_table\n result = self._do_insert(cls._base_manager, using, fields, update_pk, raw)\n File "/usr/share/archivematica/virtualenvs/ar chivematica-mcp-client/local/lib/python2.7/site-packages/django/db/models/base.py", line 885, in _do_insert\n using=using, raw=raw)\n File "/usr/share/archivematica/virtualenvs/archivematica-mcp-client /local/lib/python2.7/site-packages/django/db/models/manager.py", line 127, in manager_method\n return getattr(self.get_queryset(), name)(args, **kwargs)\n File "/usr/share/archivematica/virtualenvs/ar chivematica-mcp-client/local/lib/python2.7/site-packages/django/db/models/query.py", line 920, in _insert\n return query.get_compiler(using=using).execute_sql(return_id)\n File "/usr/share/archivematic a/virtualenvs/archivematica-mcp-client/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 974, in execute_sql\n cursor.execute(sql, params)\n File "/usr/share/archivematica/virtu alenvs/archivematica-mcp-client/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute\n return self.cursor.execute(sql, params)\n File "/usr/share/archivematica/virtualenv s/archivematica-mcp-client/local/lib/python2.7/site-packages/django/db/utils.py", line 98, in exit\n six.reraise(dj_exc_type, dj_exc_value, traceback)\n File "/usr/share/archivematica/virtualenvs/a rchivematica-mcp-client/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute\n return self.cursor.execute(sql, params)\n File "/usr/share/archivematica/virtualenvs/archiv ematica-mcp-client/local/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 124, in execute\n return self.cursor.execute(query, args)\n File "/usr/share/archivematica/virtualenvs/archi vematica-mcp-client/local/lib/python2.7/site-packages/MySQLdb/cursors.py", line 250, in execute\n self.errorhandler(self, exc, value)\n File "/usr/share/archivematica/virtualenvs/archivematica-mcp-clie nt/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 42, in defaulterrorhandler\n raise errorvalue\ndjango.db.utils.OperationalError: (1366, "Incorrect string value: \'\\xEF\\xBC\\x8 6 Re...\' for column \'content\' at row 1")\n'}

Steps to reproduce Put a '&' symbol into a PDF file's embedded metadata. Here is a live example from our environment:

image

Removing the & symbol allows the workflow to run normally.

Your environment (version of Archivematica, OS version, etc) Archivematica version 1.7.1, Ubuntu Xenial.

jhsimpson commented 6 years ago

I attempted to reproduce this issue but I was not able to in my one test. I tried both Archivematica 1.7.1 and a deployment from qa/1.x.

I edited the XMP metadata of a sample pdf file and added an ampersand into the Contributor metadata field (that field was previously blank in my test pdf). In the stdout showing in the task details for the characterize and extract job, I can see the ampersand displayed:

Producer    Adobe PDF Library 9.0
Title   Technology responsiveness for digital preservation: a model
Contributor Nance &amp; friends
Creator N.Y. McGovern
PageLayout  OneColumn
PageCount   306

Perhaps I need to insert the ampersand into a custom metadata field? I am not sure how to do that. @jarrodharvey are you able to share a sample file that reproduces this problem?

jarrodharvey commented 6 years ago

Thank you for helping test this, jhsimpson.

Were you testing using '&' or '&'? The latter is a 'different' version of the ampersand symbol (not sure of the proper terminology here) and that's the one that both of my 1.7.1 testing environments break on in the exact same way. I should have specified.