archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: failure to serialise large METS files #1203

Open mjaddis opened 4 years ago

mjaddis commented 4 years ago

Expected behaviour The METS file for an AIP should be created and serialised without errors even for datasets with a large number of files or lots of descriptive metadata in dmdSecs, i.e. if the METS is really big.

Current behaviour An error is thrown when the METS is serialised to disk during the Ingest stage of the workflow. The METS appears to be created correctly in memory and it also appears to be written OK to disk (the file is complete and not half-written etc.). However, a serialisation error is thrown and the workflow fails.

May  8 20:13:50 ip-172-31-11-6 python[1194]: DmdSecs: 211111
May  8 20:13:50 ip-172-31-11-6 python[1194]: AmdSecs: 200002
May  8 20:13:50 ip-172-31-11-6 python[1194]: TechMDs: 200002
May  8 20:13:50 ip-172-31-11-6 python[1194]: RightsMDs: 0
May  8 20:13:50 ip-172-31-11-6 python[1194]: DigiprovMDs: 1400011
May  8 20:13:50 ip-172-31-11-6 python[1194]: =============== END STDOUT ===============
May  8 20:13:50 ip-172-31-11-6 python[1194]: =============== STDERR ===============
May  8 20:13:50 ip-172-31-11-6 python[1194]: /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/metadataReminder/docx100k_m1-32c2af37-a9b5-48a0-bf9d-c39320711837/metadata doesn't exist
May  8 20:13:50 ip-172-31-11-6 python[1194]: SerialisationError(u'unknown error -668892922',)
May  8 20:13:50 ip-172-31-11-6 python[1194]: Traceback (most recent call last):
May  8 20:13:50 ip-172-31-11-6 python[1194]:   File "/usr/lib/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 1861, in call
May  8 20:13:50 ip-172-31-11-6 python[1194]:     write_mets(tree, XMLFile)
May  8 20:13:50 ip-172-31-11-6 python[1194]:   File "/usr/lib/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 1481, in write_mets
May  8 20:13:50 ip-172-31-11-6 python[1194]:     tree.write(filename, pretty_print=True, xml_declaration=True, encoding="utf-8")
May  8 20:13:50 ip-172-31-11-6 python[1194]:   File "src/lxml/lxml.etree.pyx", line 2033, in lxml.etree._ElementTree.write (src/lxml/lxml.etree.c:67992)
May  8 20:13:50 ip-172-31-11-6 python[1194]:   File "src/lxml/serializer.pxi", line 526, in lxml.etree._tofilelike (src/lxml/lxml.etree.c:144437)
May  8 20:13:50 ip-172-31-11-6 python[1194]:   File "src/lxml/serializer.pxi", line 195, in lxml.etree._raiseSerialisationError (src/lxml/lxml.etree.c:140294)
May  8 20:13:50 ip-172-31-11-6 python[1194]: SerialisationError: unknown error -668892922
May  8 20:13:50 ip-172-31-11-6 python[1194]: =============== END STDERR ===============

The METS file created is approx 3.4GB in size.

If I don't include any metadata then the same failure occurs but with a different error code:

May 11 12:57:55 ip-172-31-11-6 python[2746]: DmdSecs: 1
May 11 12:57:55 ip-172-31-11-6 python[2746]: AmdSecs: 200001
May 11 12:57:55 ip-172-31-11-6 python[2746]: TechMDs: 200001
May 11 12:57:55 ip-172-31-11-6 python[2746]: RightsMDs: 0
May 11 12:57:55 ip-172-31-11-6 python[2746]: DigiprovMDs: 1400005
May 11 12:57:55 ip-172-31-11-6 python[2746]: =============== END STDOUT ===============
May 11 12:57:55 ip-172-31-11-6 python[2746]: =============== STDERR ===============
May 11 12:57:55 ip-172-31-11-6 python[2746]: /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/metadataReminder/docx100k_nm1-57a6cef3-161d-4338-b048-ce509394797e/metadata doesn't exist
May 11 12:57:55 ip-172-31-11-6 python[2746]: SerialisationError(u'unknown error -1372478162',)
May 11 12:57:55 ip-172-31-11-6 python[2746]: Traceback (most recent call last):
May 11 12:57:55 ip-172-31-11-6 python[2746]:   File "/usr/lib/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 1861, in call
May 11 12:57:55 ip-172-31-11-6 python[2746]:     write_mets(tree, XMLFile)
May 11 12:57:55 ip-172-31-11-6 python[2746]:   File "/usr/lib/archivematica/MCPClient/clientScripts/create_mets_v2.py", line 1481, in write_mets
May 11 12:57:55 ip-172-31-11-6 python[2746]:     tree.write(filename, pretty_print=True, xml_declaration=True, encoding="utf-8")
May 11 12:57:55 ip-172-31-11-6 python[2746]:   File "src/lxml/lxml.etree.pyx", line 2033, in lxml.etree._ElementTree.write (src/lxml/lxml.etree.c:67992)
May 11 12:57:55 ip-172-31-11-6 python[2746]:   File "src/lxml/serializer.pxi", line 526, in lxml.etree._tofilelike (src/lxml/lxml.etree.c:144437)
May 11 12:57:55 ip-172-31-11-6 python[2746]:   File "src/lxml/serializer.pxi", line 195, in lxml.etree._raiseSerialisationError (src/lxml/lxml.etree.c:140294)
May 11 12:57:55 ip-172-31-11-6 python[2746]: SerialisationError: unknown error -1372478162
May 11 12:57:55 ip-172-31-11-6 python[2746]: =============== END STDERR ===============

However, if I drop the number of files to 30,000 but still include a metadata.csv (only 60,000 rows!) then it works fine. There's something about the size/complexity of the METS that causes problems when there are a large number of files in the Transfer.

Steps to reproduce I made a Transfer into Archivematica 1.11 on ubuntu 18.04 that had 100,000 files that were organised into approx 100,000 directories. I included a metadata.csv that attaches metadata to both the files and the directories. This had 200,000 rows with each row having 20 fields. The files were small MS Word docs and I added FPR capabilities to convert these to PDF/A. I normalised for preservation and I turned off thumbnail generation.

Your environment (version of Archivematica, operating system, other relevant details)

The job was run on an EC2 instance with 32 cores and 128GB of memory.

I don't think this is a memory issue because the lxml code should throw a separate error if there is a memory allocation problem. There's also 128GB of memory on the machine and I didn't see any messages in syslog etc. saying memory was a problem. The 30,000 file dataset that did work also goes through on a machine with much less memory.

I'm wondering if it's the number of XML nodes that's the issue, or the use of pretty-printing.

I also recognise that the problem above is probably right at the extreme of what people are trying to do with Archivematica. The AIP, had it been finished, would have 200,000 files and a METS file with 60M lines of XML.

Overall, I'm posting this issue just to log what I found in case others see the same at some point rather than expecting a bug-fix anytime soon.


For Artefactual use:

Before you close this issue, you must check off the following:

sromkey commented 4 years ago

@mjaddis I'm curious what you think of the solution being a less verbose METS vs successfully printing the METS in the format we currently have? I'm thinking for example of some of the concepts explored here: https://wiki.archivematica.org/PREMIS/METS_for_scalability

mjaddis commented 4 years ago

@sromkey I think those solutions are complementary rather than alternatives. I'm all in favour of reducing METS verbosity - that benefits all Archivematica users irrespective of the size of the Transfers they might have - which can only be a good thing. I remember discussing many of the things on the wiki page when we were working with you guys on the RDSS project. In my case there's lots of opportunity for deduplication in the METS, so it would undoubtedly help. For example, there's 1.4million digiprovMDs in my METS, so even simple things like saying 'all files passed virus checking' and 'all files are fmt/40' rather than having this repeated 100,000 times would save millions of lines of XML. But this to some extent only kicks the can down the road. It might get to 500,000 files in a Transfer rather than 50,000, but there would still be a limit. That's because all the METS is still in a single file and it eventually gets huge. One of the things I've always thought would be nice is to split the METS file apart into lots of smaller files, for example a METS file for each file in the Transfer as well as a top-level METS backbone that ties them all together. For example, if there was a directory tree that contained METS files that mirrored the directory tree of files I want preserved, then this could hold a hierarchy of METS files that matched my content files. Each one would be a few MBs in size and it would naturally scale. Of course that brings in many other issues, e.g. backwards compatibility with existing AIPs and a load of development work in Archivematica, so is probably something that can't be done that easily.