LibraryOfCongress / bagger

The Bagger application packages data files according to the BagIt specification.
Other
120 stars 19 forks source link

Bagger does not correctly escape filenames with newline characters in the manifest #35

Closed nkrabben closed 6 years ago

nkrabben commented 8 years ago

If you create a bag with a file that has a newline character in its filename, the manifest will not represent that characters. For example, "data/Icon\r" is represented as "data/Icon" in the manifest. The bag cannot be validated because the file system does not contain a file at "data/Icon".

A further complication is that this bug can cause filename collisions in the manifest. "data/Icon", "data/Icon\n", and "data/Icon\r" are all represented in the manifest as "data/Icon"

Bagit-python had the same issue. It was fixed by using percent encoding for \r and \n. https://github.com/LibraryOfCongress/bagit-python/issues/12

johnscancella commented 8 years ago

Hi @nkrabben I am curious, what kind of files are you trying to bag that contain \r?

johnscancella commented 8 years ago

Could you also provide a sample file or a set of directions to reproduce the issue?

nkrabben commented 8 years ago

The files are Mac Icon files from a born-digital collection that we're processing.

Apparently, it's fairly common for their file names to end with \r although from the sample we're looking at, it's not consistent. Since it's a problem with the file name and not the content, I've been creating test files through bash. touch "Icon^M"

johnscancella commented 8 years ago

Looks like this is due to the underlying bagit-java library which should be percent encoding the filename and decoding it when verifying. See https://github.com/LibraryOfCongress/bagit-java/issues/51

johnscancella commented 8 years ago

@nkrabben take a look at https://github.com/LibraryOfCongress/bagger/releases/tag/v2.7.1 it should fix your problem.

nkrabben commented 8 years ago

Bagger 2.7.1 is crashing before it finishes loading. I get a few different errors. Loading from the command line I get Error: Could not find or load main class 2. Loading from Finder I get An illegal state occured.

I can still load 2.7.0

johnscancella commented 8 years ago

Sorry, I was impatient and built it without all the dependencies. I deleted the old version on the releases page and uploaded the new file. I also included the MD5 and SHA1 checksum so you can test and make sure you have the right version.

nkrabben commented 8 years ago

New error Could not find or load main class 3

johnscancella commented 8 years ago

Really? I just tried running it and it was fine. Did you delete the old one? Could you add the log files so I can see exactly what happened?

nkrabben commented 8 years ago

Looks like it was my error. Deleted everything and Bagger loaded fine.

While testing the filename handling, I noticed a new edge case. Bagging a folder that contains "Icon\r" works as expected, but adding "Icon\r" directly to the payload gives the following error: Error adding bag file: ~/Desktop/Icon due to: Can't read ~/Desktop/Icon

johnscancella commented 8 years ago

It looks like it is a bug in the JDK rt.jar supplied code. Specifically BasicFileChooserUI's inner class ApproveSelectionAction. The code is repeated here for easy viewing:

if (filename != null) {
    // Remove whitespaces from end of filename
        int i = filename.length() - 1;

        while (i >=0 && filename.charAt(i) <= ' ') {
            i--;
        }

        filename = filename.substring(0, i + 1);
}

This code as the comment suggests removes any whitespace at the end of a file name. In this case that is not the correct behavior. I will try and file a bug with the Java developers and post a link here.

johnscancella commented 8 years ago

Oracle has created a bug based on my report http://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8160540