Simple Archive toolchain

bondjimbond commented 5 years ago

I've got a DSpace repository to migrate, and aside from OAI-PMH, the most promising of their export options seems to be the Simple Archive format.

This provides a directory full of subdirectories, each subdirectory representing an object, containing the main object file (PDF or JPG or whatever), metadata, and various other datastreams (or "bitstreams" in DSpace terms).

At minimum, every directory will have a dublin_core.xml file and the main object file.

A sample export is attached.

Basically, the toolchain will need to loop through the subdirectories, and in each subdirectory, do the following:

Identify and extract the main object file based on extension.
Extract the dublin_core.xml file and transform it to MODS according to the user's mapping file.
Spit everything into one batch-ready directory with files renamed as appropriate - and make sure that any duplicates are rewritten so that they aren't overwriting (e.g. if two different objects are called "file.pdf" then make the next one "file_1.pdf" etc.)

partial_export_2019_Jun_04_3_5.zip

bondjimbond commented 4 years ago

Files for testing then... Adjust paths in .ini file as required. Should work with the Simple Archive export sample at the start of this issue. My output is attached as well.

output.zip

Filesystem_Subdirectories.php.fetcher.txt mods.xsl.txt mru_xsl.ini.txt SimpleArchive.php.filegetter.txt SimpleArchive.php.writer.txt Xslt.php.metadataparser.txt

bondjimbond commented 4 years ago

Ran this on my full export, and there were some problem files.

The attached folder can be added to the sample group. Its output ended up giving me 0-IGDA-Curriculum.pdf.txt instead of 0-IGDA-Curriculum.pdf. So something wrong with the filegetter, I think.

214.zip

mjordan commented 4 years ago

Look in the "content" file for that item. If it doesn't list the PDF as the first item, that would explain it.

bondjimbond commented 4 years ago

Yep, that's the case. So how do we fix that?

My initial attempt used the (new) filetype setting and searched for items with the appropriate extension. But that isn't necessarily the best approach...

Perhaps a better way would be to simply extract everything listed in contents excluding license.txt and license.rdf?

Would Islandora Batch recognize a bundle of PDF, XML, and TXT as OBJ, MODS, and OCR?

mjordan commented 4 years ago

If we know that the rows in the 'contents' file are not in any specific order, my approach would be to parse each line into an associative array and picking out the specific key that we want. I can take a look this evening or tomorrow morning.

mjordan commented 4 years ago

WRT your question "Would Islandora Batch recognize a bundle of PDF, XML, and TXT as OBJ, MODS, and OCR?", I don't think so. https://github.com/mjordan/islandora_batch_with_derivs does though, and this toolchain's writer could have an option to write out the packages in the format that batch module requires.

bondjimbond commented 4 years ago

If we know that the rows in the 'contents' file are not in any specific order, my approach would be to parse each line into an associative array and picking out the specific key that we want. I can take a look this evening or tomorrow morning.

For this option, would the filetype parameter come back into play, then? This would provide the associative array the input required to make a decision on which file to extract.

I also got a bunch of items that could only be conceived of as Compound Objects, which have more than one of a given filetype.. But these will need a separate version of the toolchain, I think. Simple one first.

bondjimbond commented 4 years ago

@mjordan This tweak seems to solve the problem.

SimpleArchive.php.filegetter.txt

    public function getFilePath($record_key)
    {
       var_dump($this->settings);
        $objectInfo = $this->fetcher->getItemInfo($record_key);
        $contents_manifest_path = $this->input_directory . DIRECTORY_SEPARATOR . $record_key . DIRECTORY_SEPARATOR . 'contents';
        $contents_manifest_content = file($contents_manifest_path);
        $contents = array();
        // Searches the 'contents' for a file matching the configured filetype.
        foreach ($contents_manifest_content AS $manifest_file) {
          list($contents[], $extra) = preg_split('/\t/', $manifest_file);
        }
        $filetype = $this->filetype;

        foreach ($contents AS $file) {
          $file_split = explode(".", $file);
          if (end($file_split) == $filetype) {
            $payload_filename = $file;
          }
        }

mjordan commented 4 years ago

If that works on all known variations of the manifest file, excellent. How consistent can we assume that file's structure is?

bondjimbond commented 4 years ago

For single-file objects, I think it's pretty sound. The problem is when you get into multi-file objects... for example:

Games_for_learning_2005.pdf bundle:ORIGINAL
Games_for_learning_2005_slides.pdf  bundle:ORIGINAL
license.txt bundle:LICENSE
Games_for_learning_2005.pdf.txt bundle:TEXT description:Extracted text
Games_for_learning_2005_slides.pdf.txt  bundle:TEXT description:Extracted text

I'm thinking, start with a simple single-file toolchain. Once that's working well, create a more expansive copy that deals with compound objects.

My absolute ideal would be one toolchain that covers the whole Simple Archive export. It would look like this:

Reviews the contents manifest and identifies whether the object is compound or single-file
- Probably based on the number of files that are not .txt or license_rdf
Processes single-file objects and puts them into new directories according to file extension
Processes compound objects in a separate Compound directory, appropriately structured

Let's get the single-file toolchain merged first, though.

mjordan commented 4 years ago

Agreed, let's start simple and expand later. From my reckoning we now have two new components that could be used in other toolchains:

Filesystem_Subdirectories fetcher
Xslt metadata parser

and two new components that are specific to the DSpace Simple Archive Format:

SimpleArchive filegetter
SimpleArchive writer

Let's do some cleanup (remove var_dump(), etc.) and code style checking and start some docs.

mjordan commented 4 years ago

@bondjimbond when you say "Processes single-file objects and puts them into new directories according to file extension" in your "Ideally...." list, are you suggesting that the Simple Archive toolchain be used with https://github.com/mjordan/islandora_batch_with_derivs?

mjordan commented 4 years ago

One way to approach the compound requirement is to make only the filegetter and the writer aware that some items might need to be split out into compounds. That way, the general components, the fetcher and metadata parser, won't even need to know about compounds in the input.

Specifically, the filegetter currently parses the manifest, which is where the distinction between compound and single is expressed in the incoming items. Also, the writer could be made to create a "compounds" diectory and write the required files there.

bondjimbond commented 4 years ago

when you say "Processes single-file objects and puts them into new directories according to file extension" in your "Ideally...." list, are you suggesting that the Simple Archive toolchain be used with https://github.com/mjordan/islandora_batch_with_derivs?

Not necessarily... Just to create Islandora Batch-ready packages.

e.g. in my export, I have PDFs, videos, MS Office files, and images. Batch only handles one CModel at a time, so I want to send each object to a subdirectory that reflects its CModel. I think you created a post-write hook for this?

However, it would be nice to have the option to include the .pdf.txt file (the OCR) for use with islandora_batch_with_derivs if the user so chooses. (Nice-to-have, not required.)

bondjimbond commented 4 years ago

One way to approach the compound requirement is to make only the filegetter and the writer aware that some items might need to be split out into compounds. That way, the general components, the fetcher and metadata parser, won't even need to know about compounds in the input.

Specifically, the filegetter currently parses the manifest, which is where the distinction between compound and single is expressed in the incoming items. Also, the writer could be made to create a "compounds" diectory and write the required files there.

Yes! Just need to make sure that each file in the compound gets a copy of the transformed MODS.

mjordan commented 4 years ago

Oh yeah, sorry, I understand "according to file extension" now. https://github.com/MarcusBarnes/mik/blob/master/extras/scripts/postwritehooks/move_packages_by_extension.php is the post-write hook script that does that.

mjordan commented 4 years ago

Just thought of something: before we add the Filesystem_Subdirectories fetcher, let's see what it would take to make the existing Filesystem fetcher do the same thing if an .ini option was set. It might just be a few if/else blocks in the Filesystem fetcher code.

bondjimbond commented 4 years ago

It's probably doable. In Filesystem_Subdirectories there are just a few changes...

Removed $this->record_key = 'ID'; at line 19
Added empty array $records = array(); at line 43
Substantive changes to lines 49-50 (is_dir instead of is_file, change to how $record->key is set)
Lines 55-67: Commented out all of the $filtered_records stuff because it was causing problems; just returning $records now
Lines 99-102: Completely changed how $record is set

All of these changes could theoretically be made dependent on some .ini options instead of having a whole new fetcher.

mjordan commented 4 years ago

Let me take a stab at a single version of the fetcher over the weekend.

bondjimbond commented 4 years ago

Any luck @mjordan?

mjordan commented 4 years ago

I've flip flopped on this. The new toolchain seems to be pretty much complete and I don't think it's worth combining the two fetchers, since we'd need to test that the integrated Filesystem fetcher still works as it currently does. Plus, I'm going on vacation for two weeks and that will just hold you up even more.

bondjimbond commented 4 years ago

Sounds good to me. So I will commit my changes and submit a PR.

MarcusBarnes / mik

Simple Archive toolchain #501