LibraryOfCongress / bagit-python

Work with BagIt packages from Python.
http://libraryofcongress.github.io/bagit-python
213 stars 85 forks source link

Bag Validation Issues #12

Closed veryaustin closed 10 years ago

veryaustin commented 10 years ago

We are using BagIt on drives that contain a variety of file types but mainly contain broadcast wave files and accompanying digital audio workstation files (Pro Tools, Nuendo, Logic, Digital Performer). I have run into an issue where Pro Tools Plugin settings files titled "Icon" are either not written to the manifest and throw and validation error, OR it throws an error indicating a file is in the manifest but not on the drive when running the bagit validate command. All files show up in terminal via the ls -la command and I have verified that all permissions are correct.

Below is the output from the bag creation and validation on a set of audio files and digital audio workstation files.

workstation-a:BNA_1017971 administrator$ bagit.py --contact-name 'Test Author' --processes 2 Cleaned\ Up\ Masters/
2014-02-06 16:12:52,878 - INFO - creating bag for directory Cleaned Up Masters/
2014-02-06 16:12:53,994 - INFO - creating data dir
2014-02-06 16:12:53,994 - INFO - moving .DS_Store to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/.DS_Store
2014-02-06 16:12:53,994 - INFO - moving Song 1 to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/Song 1
2014-02-06 16:12:53,995 - INFO - moving Song 2 to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/Song 2
2014-02-06 16:12:53,995 - INFO - moving Song 3 to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/Song 3
2014-02-06 16:12:53,995 - INFO - writing manifest-md5.txt
2014-02-06 16:12:53,995 - INFO - writing manifest with 2 processes
2014-02-06 16:18:56,256 - INFO - writing bagit.txt
2014-02-06 16:18:56,257 - INFO - writing bag-info.txt

workstation-a:BNA_1017971 administrator$ bagit.py --validate Cleaned\ Up\ Masters/
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Full Mix Settings/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Vocals/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/De-Essers/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Drums/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/Purple MC77/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/bombfactory BF2A/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Guitars/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Compressors/Icon exists in manifest but not found on filesystem
 exists on filesystem but is not in manifestg 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/Purple MC77/Icon
 exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/De-Essers/Icon
 exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Compressors/Icon
 exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Guitars/Icon
 exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Full Mix Settings/Icon
 exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Drums/Icon
 exists on filesystem but is not in manifestg 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/bombfactory BF2A/Icon
 exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Vocals/Icon
2014-02-06 16:44:17,422 - INFO - Cleaned Up Masters/ is invalid: invalid bag: data/Song 3/Plug-In Settings/ChannelStrip/Full Mix Settings/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/Vocals/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/De-Essers/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/Drums/Icon exists in manifest but not found on filesystem ; data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/Purple MC77/Icon exists in manifest but not found on filesystem ; data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/bombfactory BF2A/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/Guitars/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/Compressors/Icon exists in manifest but not found on filesystem ; data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settin exists on filesystem but is not in manifest ; data/Song 3/Plug-In Settings/ChannelStrip/Vocals/IconSettings/bombfactory BF2A/Icon

I copied the "Icon" files out of each of the three songs and put them into "Test Files" directory and ran the bagit create and validate commands. Below is the output:

workstation-a:BNA_1017971 administrator$ bagit.py --contact-name 'Test Author' --processes 2 Test\ Files/
2014-02-06 16:11:42,977 - INFO - creating bag for directory Test Files/
2014-02-06 16:11:42,978 - INFO - creating data dir
2014-02-06 16:11:43,008 - INFO - moving .DS_Store to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/.DS_Store
2014-02-06 16:11:43,009 - INFO - moving 1 to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/1
2014-02-06 16:11:43,009 - INFO - moving 2 to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/2
2014-02-06 16:11:43,009 - INFO - moving 3 to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/3
2014-02-06 16:11:43,009 - INFO - writing manifest-md5.txt
2014-02-06 16:11:43,010 - INFO - writing manifest with 2 processes
2014-02-06 16:11:43,207 - INFO - writing bagit.txt
2014-02-06 16:11:43,207 - INFO - writing bag-info.txt

workstation-a:BNA_1017971 administrator$ bagit.py --validate Test\ Files/
2014-02-06 16:11:51,514 - WARNING - data/3/Icon exists in manifest but not found on filesystem
2014-02-06 16:11:51,514 - WARNING - data/1/Icon exists in manifest but not found on filesystem
 exists on filesystem but is not in manifestcon
 exists on filesystem but is not in manifestcon
 exists on filesystem but is not in manifest ; data/3/Iconnvalid bag: data/3/Icon exists in manifest but not found on filesystem ; data/1/Icon exists in manifest but not found on filesystem ; data/1/Icon

Unfortuneately, I cannot include any of the specific sessions and audio files listed in the first example but I can provide example "Icon" files for testing. They can be downloaded at the following link:

https://bmschace.box.com/bagittestfiles

Any help with this issue would be greatly appreciated.

Thanks! Austin Lauritsen Director of IT BMS/Chace

edsu commented 10 years ago

I apologize for the delay in getting back to you. Let me see if I'm understanding your situation properly. Are you:

  1. bagging a directory
  2. opening some files in that bag with ProTools
  3. ProTools creates one or more Icon files in the bag payload (data directory)
  4. validating the bag

Section 3.4 in the spec states that:

Every payload file MUST be listed in at least one manifest. Payload files MAY be listed in more than one payload manifest.

So validation fails if there is an Icon file present in the bag that isn't listed in a manifest. Does that help? Perhaps you should configure your software not to create Icon files, or to create them at another location. Alternatively you could run a command to delete them before validating. Lastly, I suppose you could move bagging to the stage after you have opened files with ProTools so the Icon files become part of the manifest. But if you subsequently open more files, which modifies or creates additional Icon files, you will still get an validation error.

But all of these options are out of scope for the bagit-python software. Validation is working as intended.

veryaustin commented 10 years ago

Thanks for the response. To clarify our workflow is as follows:

  1. All audio content on the drive is created by a third party. After the audio content is created, the drive is delievered to us (I will refer to this drive from now on as the "Original" drive. We cannot delete files on the drive nor is the third party able tell the software not to create "Icon" files. The "Icon" file is a data file that stores information the DAW uses to store plugins settings and is not an image or .ico type file.
  2. We run the bagit bag creation command on the Original drive that is delivered to us.
  3. We copy the contents of the Original drive to a new drive. This includes all of the bagit files generated by the bagit creation command.
  4. Run the bagit validation command on the new drive to verify we have an exact copy of the Original drive. We are having issues with these "Icon" files causing the validation command to fail. Some of the Icon files are in the manifest but for some reason the validation command isn't seeing them on the disk (even though they are there and are viewable via command line) OR the "Icon" files are found on the disk but not listed in the manifest.

Initially I thought the reason the validation failed was because the copy from the original drive to the new drive may have not been an exact copy. I then ran the validation command on the Original drive and it too failed, returning the errors listed in the original post on this thread. As you can see in the validation errors, it says "Icon exists in manifest but not found on filesystem" OR "exists on filesystem but is not in manifestg 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/Purple MC77/Icon"

edsu commented 10 years ago

Ok, I think I understand better. Are you able to check out the github project and run the tests to make sure those at least work, as a baseline?

Also, I'm curious what version of bagit-python you are using. I can't find the output of line number 114 in the log output you pasted above.

veryaustin commented 10 years ago

I'm running these tests on a fresh install of Mac OS 10.8.5 with Python 2.7.6. I was able to checkout the the most recent code on github, and do a build & install. Additionally, I was able to and run test.py and can verify that the tests returned "OK".

If you would like to download and test some sample files that are causing errors, you can download these at https://bmschace.box.com/bagittestfiles.

I ran the bag create and validate command. Below is the output:

test-bench-a:BNA_1017971 administrator$ bagit.py --contact-name 'Test Author' --processes 2 Test\ Files/
2014-03-19 13:36:03,127 - INFO - creating bag for directory /Volumes/BNA_1017971/Test Files
2014-03-19 13:36:03,128 - INFO - creating data dir
2014-03-19 13:36:03,128 - INFO - moving 1 to /Volumes/BNA_1017971/Test Files/tmpYP9s_K/1
2014-03-19 13:36:03,128 - INFO - moving 2 to /Volumes/BNA_1017971/Test Files/tmpYP9s_K/2
2014-03-19 13:36:03,128 - INFO - moving 3 to /Volumes/BNA_1017971/Test Files/tmpYP9s_K/3
2014-03-19 13:36:03,129 - INFO - moving /Volumes/BNA_1017971/Test Files/tmpYP9s_K to data
2014-03-19 13:36:03,129 - INFO - writing manifest-md5.txt
2014-03-19 13:36:03,129 - INFO - writing manifest with 2 processes
2014-03-19 13:36:03,244 - INFO - writing bagit.txt
2014-03-19 13:36:03,245 - INFO - writing bag-info.txt
test-bench-a:BNA_1017971 administrator$ bagit.py --validate Test\ Files/
2014-03-19 13:36:14,015 - WARNING - data/3/Icon exists in manifest but not found on filesystem
2014-03-19 13:36:14,015 - WARNING - data/1/Icon exists in manifest but not found on filesystem
 exists on filesystem but is not in manifestcon
 exists on filesystem but is not in manifestcon
2014-03-19 13:36:14,016 - INFO - Test Files/ is invalid: invalid bag: data/3/Icon exists in manifest but not found on filesystem ; data/1/Icon exis exists on filesystem but is not in manifest ; data/3/Icon
test-bench-a:BNA_1017971 administrator$
edsu commented 10 years ago

See how the line:

2014-03-19 13:36:03,129 - INFO - moving /Volumes/BNA_1017971/Test Files/tmpYP9s_K to data

I don't know why an equivalent line doesn't show up in your first paste above. Or is it there and I'm just missing it? Thanks for the test files, I'll give them a try!

veryaustin commented 10 years ago

The first post was ran in Feb and looks to be a different version as now I get the "tagmanifest-md5.txt" file which I didn't in the original post. The only thing similar to the line you are referring to in the original post is the following. Like you I noticed it doesn't say "to data" at the end of the line:

2014-02-06 16:12:53,994 - INFO - moving .DS_Store to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/.DS_Store
2014-02-06 16:12:53,994 - INFO - moving Song 1 to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/Song 1
2014-02-06 16:12:53,995 - INFO - moving Song 2 to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/Song 2
2014-02-06 16:12:53,995 - INFO - moving Song 3 to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/Song 3

In the original post, the same is true for the "Test Files"

2014-02-06 16:11:43,008 - INFO - moving .DS_Store to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/.DS_Store
2014-02-06 16:11:43,009 - INFO - moving 1 to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/1
2014-02-06 16:11:43,009 - INFO - moving 2 to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/2
2014-02-06 16:11:43,009 - INFO - moving 3 to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/3

Either way, I'm on a fresh testing machine and have everything on a baseline with what is currently in the github repo and am still having the same issues. Again, let me know how I can help and I appreciate your assistance with this.

edsu commented 10 years ago

Ok, so here's what I see when I validate the bag in your Test Files.zip:

.tfx exists in manifest but not found on filesystem
2014-03-19 15:06:50,324 - WARNING - data/2/Icon.tfx exists on filesystem but is not in manifest
.tfx exists in manifest but not found on filesystem ; data/2/Icon.tfx exists on filesystem but is not in manifest

And looking at the manifest I see the problem! The paths seem to have embedded carriage returns in them: ascii 0x0d bytes. I only noticed because I opened the manifest up in my text editor:

screen shot 2014-03-19 at 3 12 47 pm

So, I will add a unit test to make sure these are getting properly encoded. The spec states that they should be URL encoded.

edsu commented 10 years ago

Now I'm confused again. I edited the manifest to remove the 3 embedded carriage returns and now the bag validates. The filenames you packaged up in that zip do not have embedded carriage returns in them. Can you verify that the files you have do have embedded carriage returns in them? Or perhaps you corrupted your manifest somehow?

veryaustin commented 10 years ago

The manifest goes with those files that are in the zip. To make sure nothing was corrupted, I removed all of the bagit generated files and re-ran the creation command on these files got the same errors. I opened the newly generated manifest-md5.txt in vim and got the same results as what you posted above. I'm having the exact same results & errors you are having with the same files.

screen shot 2014-03-19 at 2 38 57 pm

edsu commented 10 years ago

Awesome, so your Icon file names really do have carriage returns in them. Live and learn :smile:

This causes a problem for the manifest since lines in there can be terminated with carriage returns. Interestingly it looks like this is a gap in the BagIt specification.

For now I'll work on a fix to percent encode the carriage returns in the manifest filenames. I'll let you know when there is something for you to try.

edsu commented 10 years ago

A bit of a historical aside: BagIt was largely conceived of as a set of conventions built around what the tool md5deep does. Interestingly md5deep seems to strip the carriage return before putting it into the manifest. This is ok as long as it doesn't result in a collision with another file. For example if you have a directory named data that contains two files:

md5deep -lr datagenerates a manifest like this:

401b30e3b8b5d629635a5c613cdb7919  data/foo
acbd18db4cc2f85cedef654fccc4a4d8  data/foo

So the question then would be, which checksum goes with which filename?

I think this is an argument for percent encoding the carriage returns, instead of stripping them. It could be argued that BagIt tools should refuse to bag a directory if the payload has filenames with carriage returns. IMHO this would be somewhat against the spirit of BagIt, which has always been to serve as a low barrier way of packaging up a directory (or folder) that contains files, without having to modify them in any way.

veryaustin commented 10 years ago

Very interesting insight:) Thanks for looking into this. I look forward to seeing & testing the fix.

edsu commented 10 years ago

Let's leave this open until there's a fix. It should come shortly -- sorry for the delay.

edsu commented 10 years ago

@veryaustin I just uploaded the latest bagit-python (which includes this fix) to PyPI as v1.3.6. I'm sorry this issue took so long to figure out and address. I'm going to be updating the BagIt specification to mention that \r and \n need to be percent encoded in manifest file names. But for the meantime it would be good to try it out here in bagit-python to see if there are any hidden gotchas that the unit tests didn't tease out.