Open ross-spencer opened 3 years ago
The client notes that the inability to load some of our larger XML is a change in behavior interacting with Archivematica. This is important to reflect on.
This could be a change in Archivematica or its dependencies and configuration settings affecting how the files are delivered from the storage service.
Prompted by this being a change in behavior I also investigated Firefox more.
I have stepped back through a few versions of Firefox and believe I have identified this may be a change in behavior between Firefox 81.0.2 (October 13 2020) and 82 (October 20 2020) which still exists today. I have logged a defect report with Mozilla here: https://bugzilla.mozilla.org/show_bug.cgi?id=1695530
The sample XML causing a problem for me is: 2021-02-28-firefox-xml-test.zip
Firefox release notes: https://www.mozilla.org/en-US/firefox/releases/ Legaxy releases: https://ftp.mozilla.org/pub/firefox/releases/
To test for this defect:
Firefox source code and possibly related area of the code-base (XML): https://searchfox.org/mozilla-central/source/dom/xml
Mozregression is recommended by the Mozilla peeps. It can take release numbers as parameters through to changeset IDs or commits. It will use a heuristic to download versions in-between to identify specific changes causing a regression. The utility and its docs can be found here: https://mozilla.github.io/mozregression/
From the client a workaround is to download the METS from the server /var/archivematica/sharedDirectory/watchedDirectories/storeAIP/
using rsync
. However:
Just being able to right-click download the METS file during the “Review AIP” stage without any attempt to render in the browser would be a big improvement.
Awesome work @ross-spencer !
Please describe the problem you'd like to be solved
Archivematica style METS do not take a long time to explode in size. If you try and create a package, say, 4000 images that all run through FITS/ExifTool/Jhove then by the time you get an AIP the METS output should be beyond 1 million lines and over 100MB in size.
Folks want to be able to access the METS easily and without their browser crashing when it is requested.
There are two areas where I'd like to see things improved:
Describe the solution you'd like to see implemented
1. AIP review page loads METS into a window by default.
The JavaScript for the AIP review mechanism is here. It opens METS in a new window by default. An alternative might be to download by default.
Other mechanisms might work better and other implementations likely exist. Right now, the widget nullifies the option to right-click download, for example, which makes it hard to control what to do for yourself.
2. Archival storage download provides little information about the METS (file-size, number of lines etc.)
I am not sure what the options are that we have that we can implement for the archival storage tab. The file should download automatically in Chrome.
In Firefox, users should check their preferences and make sure that how Firefox handles XML is configured to "Always ask" (and be careful not to select "Open in Firefox" or select other options such as "Use
<insert XML editor here>
". The Browser's ability to deliver 100s MB of file is far greater than its ability to render XML efficiently for you.I also wondered if it benefits users to have access to information about the METS file itself, e.g.
Describe alternatives you've considered
There are some alternatives considered above. There may be plenty others.
Additional context
Related to issues such as https://github.com/archivematica/Issues/issues/1203 where METS files do get too large. There are related conversations about minimizing or reducing that output.
METS is not so much the culprit here. You can create a large XML file very easily. Take the output of this script and pipe it into a file with an xml extension: You will have around 4million lines, with a little, but not much complexity, and ~100MB of data.
One of the two popular browsers can try opening it:
My bet is that it won't load, or it will take longer than you're willing to wait. My machine crashed four times this morning working on this.
python3 -m http.server
from the same folder. You'll see the same./var/archivematica/sharedDirectory/watchedDirectories/storeAIP
and try reading it from the Review AIP pane.Reducing the inner loop of the script to
100000
iterations you'll get~200000
lines. This starts to take in upwards of 4 minutes to render in either browser.And starts to take up a huge amount of memory, 2GiB
I have struggled to find relevant modern browser-based issues talking about performance of XML rendering. This ticket in Firefox's Bugzilla talks about large files as being in their MBs and starts to describe the speed of slowdown: https://bugzilla.mozilla.org/show_bug.cgi?id=197956 - notably this is a while back. The perception of a large XML is also somewhat different from our own.
The issue seems to be largely related to the Browser's ability to load the XML into a DOM and provide some basic ways of interacting with it as well as limited syntax highlighting.
For Artefactual use:
Before you close this issue, you must check off the following: