archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: Browsers are pretty terrible at rendering large XML files and we have some gnarly METS out there #1426

Open ross-spencer opened 3 years ago

ross-spencer commented 3 years ago

Please describe the problem you'd like to be solved

Archivematica style METS do not take a long time to explode in size. If you try and create a package, say, 4000 images that all run through FITS/ExifTool/Jhove then by the time you get an AIP the METS output should be beyond 1 million lines and over 100MB in size.

Folks want to be able to access the METS easily and without their browser crashing when it is requested.

There are two areas where I'd like to see things improved:

  1. AIP review page loads METS into a window by default.
  2. Archival storage download provides little information about the METS (file-size, number of lines etc.)

Describe the solution you'd like to see implemented

1. AIP review page loads METS into a window by default.

image

The JavaScript for the AIP review mechanism is here. It opens METS in a new window by default. An alternative might be to download by default.

diff --git a/src/dashboard/src/media/js/ingest/aip_browser.js b/src/dashboard/src/media/js/ingest/aip_browser.js
index 9b5f0b78e..f6069e044 100644
--- a/src/dashboard/src/media/js/ingest/aip_browser.js
+++ b/src/dashboard/src/media/js/ingest/aip_browser.js
@@ -17,6 +17,23 @@ You should have received a copy of the GNU General Public License
 along with Archivematica.  If not, see <http://www.gnu.org/licenses/>.
 */

+function download(filename, text) {
+  var element = document.createElement('a');
+  element.setAttribute('href', text);
+  element.setAttribute('download', filename);
+
+  element.style.display = 'none';
+  document.body.appendChild(element);
+
+  element.click();
+
+  document.body.removeChild(element);
+}
+
+function fileName(path) {
+  return path.split('/').pop()
+}
+
 function setupAIPBrowser(directory) {

   var explorer = new FileExplorer({
@@ -25,10 +42,8 @@ function setupAIPBrowser(directory) {
     entryTemplate: $('#template-dir-entry').html(),
     nameClickHandler: function(result) {
       if (result.type != 'directory') {
-        window.open(
-          '/filesystem/download_fs/?filepath=' + encodeURIComponent(result.path),
-          '_blank'
-        );
+        fileLoc = "".concat('/filesystem/download_fs/?filepath=', encodeURIComponent(result.path));
+        download(fileName(atob(result.path)), fileLoc);
       }
     },
     disableDragAndDrop: true

Other mechanisms might work better and other implementations likely exist. Right now, the widget nullifies the option to right-click download, for example, which makes it hard to control what to do for yourself.

2. Archival storage download provides little information about the METS (file-size, number of lines etc.)

I am not sure what the options are that we have that we can implement for the archival storage tab. The file should download automatically in Chrome.

In Firefox, users should check their preferences and make sure that how Firefox handles XML is configured to "Always ask" (and be careful not to select "Open in Firefox" or select other options such as "Use <insert XML editor here>". The Browser's ability to deliver 100s MB of file is far greater than its ability to render XML efficiently for you.

image

I also wondered if it benefits users to have access to information about the METS file itself, e.g.

Describe alternatives you've considered

There are some alternatives considered above. There may be plenty others.

Additional context

Related to issues such as https://github.com/archivematica/Issues/issues/1203 where METS files do get too large. There are related conversations about minimizing or reducing that output.

METS is not so much the culprit here. You can create a large XML file very easily. Take the output of this script and pipe it into a file with an xml extension: You will have around 4million lines, with a little, but not much complexity, and ~100MB of data.

print("<?xml version='1.0' encoding='UTF-8'?>")
print("<book>")
for x in range(1000000):
 print("<summary>this is a summary line that is fairly long</summary>")
 print("<chapter><page>this is a nested line</page></chapter>")
print("</book>")

One of the two popular browsers can try opening it:

$ google-chrome <myfile>.xml
$ firefox <myfile>.xml

My bet is that it won't load, or it will take longer than you're willing to wait. My machine crashed four times this morning working on this.

Reducing the inner loop of the script to 100000 iterations you'll get ~200000 lines. This starts to take in upwards of 4 minutes to render in either browser.

image

image

And starts to take up a huge amount of memory, 2GiB

image

I have struggled to find relevant modern browser-based issues talking about performance of XML rendering. This ticket in Firefox's Bugzilla talks about large files as being in their MBs and starts to describe the speed of slowdown: https://bugzilla.mozilla.org/show_bug.cgi?id=197956 - notably this is a while back. The perception of a large XML is also somewhat different from our own.

The issue seems to be largely related to the Browser's ability to load the XML into a DOM and provide some basic ways of interacting with it as well as limited syntax highlighting.


For Artefactual use:

Before you close this issue, you must check off the following:

ross-spencer commented 3 years ago

The client notes that the inability to load some of our larger XML is a change in behavior interacting with Archivematica. This is important to reflect on.

This could be a change in Archivematica or its dependencies and configuration settings affecting how the files are delivered from the storage service.

Prompted by this being a change in behavior I also investigated Firefox more.

I have stepped back through a few versions of Firefox and believe I have identified this may be a change in behavior between Firefox 81.0.2 (October 13 2020) and 82 (October 20 2020) which still exists today. I have logged a defect report with Mozilla here: https://bugzilla.mozilla.org/show_bug.cgi?id=1695530

The sample XML causing a problem for me is: 2021-02-28-firefox-xml-test.zip

Firefox release notes: https://www.mozilla.org/en-US/firefox/releases/ Legaxy releases: https://ftp.mozilla.org/pub/firefox/releases/

To test for this defect:

  1. Download from legacy.
  2. Unpack.
  3. Disable internet (important as it prevents Firefox automatically updating).
  4. Load XML.
  5. Monitor the load time.

Firefox source code and possibly related area of the code-base (XML): https://searchfox.org/mozilla-central/source/dom/xml

ross-spencer commented 3 years ago

Mozregression is recommended by the Mozilla peeps. It can take release numbers as parameters through to changeset IDs or commits. It will use a heuristic to download versions in-between to identify specific changes causing a regression. The utility and its docs can be found here: https://mozilla.github.io/mozregression/

ross-spencer commented 3 years ago

From the client a workaround is to download the METS from the server /var/archivematica/sharedDirectory/watchedDirectories/storeAIP/ using rsync. However:

Just being able to right-click download the METS file during the “Review AIP” stage without any attempt to render in the browser would be a big improvement.

scollazo commented 3 years ago

Awesome work @ross-spencer !