delving / narthex

Performs bulk dataset-processing for the Delving platform.
Apache License 2.0
3 stars 1 forks source link

SIP-Zip combined with OAI-PMH harvest #46

Closed geralddejong closed 9 years ago

geralddejong commented 9 years ago

Background

Originally the migration of datasets did not involve switching from the current target format to EDM, but rather they involved more straightforward mapping modifications which maintained the same target format. The mappings were based on SIP-Creator mappings where different selections of RecordRoot and UniqueID were made, depending on where the data is to be found in the source record.

Difficulties arose with OAI-PMH harvested datasets due to a discrepancy in practise: sometimes the OAI-PMH pages contain a "record" node inside of its "metadata" tag, and sometimes there is no "record" but just the fields of the record within "metadata". In the latter case, Narthex intervened and inserted a "record" node called "sip-record" to unify the two cases for further processing. This also required the user to choose which of the two cases was the nature of the harvested data beforehand, which was tricky to determine (required pre-observing the harvest in a browser, if not known).

Now mappings are being generated by an external process from previously created mappings with a transformation of paths as well as some generation of code snippets.

SIP-Zip combined with Harvest

SIP-Zip files for testing and intitial migrations have so far always contained data stored in source.xml.gz, which is "adopted" upon dropping the SIP-Zip, allowing the whole process to proceed. It was and is also an option to initiate a dataset by harvesting (ie. no mapping yet) and then to go through the normal process of building the mapping and then saving it and proceeding from there. Combining both SIP-Zip drop and harvest worked originally (in fact, the first harvest used to be triggered automatically when there was harvest info available and no source) but that has been disabled since the move to mapping everything to EDM.

Generated EDM Mappings combined with Harvest

When the external process (IPython Notebook) is used to generate mapping files on the basis of existing SIP-Creator mappings, it has to make some assumptions based on the differences between SIP-Creator harvested data and the result of a Narthex harvest. So far the SIP-Zip files have contained source data (structured as "pockets", even including the "sip-record" encapsulation described above), but for this approach they should not contain source. The source data must be harvested only, and the SIP-Zip can contain the harvest information (so that the Narthex fields for that can be automatically filled).

Narthex harvests "pages" and stores them as-is in its Source Repository, so that means nothing is done to the fetched pages returned from the OAI-PMH requests. (The SIP-Creator created a file with records wrapped in the tag <sip-harvest>...</sip-harvest>)

This means that the records can always be delimited in the following way:

recordRoot = "/OAI-PMH/ListRecords/record",
uniqueId = "/OAI-PMH/ListRecords/record/header/identifier",

This also means that the externally generated mappings must be built with paths which assume that "record" is the "recordRoot", implying that all paths in the mapping must navigate down through "/input/metadata/..." to refer to the data elements.

Narthex can then make the following assumptions whenever OAI-PMH harvests are done:

val PMH = HarvestType(
  name = "pmh",
  recordRoot = "/OAI-PMH/ListRecords/record",
  uniqueId = "/OAI-PMH/ListRecords/record/header/identifier",
  recordContainer = None
)

There has been discussion of making it so that anytime a harvest is done, the recordRoot and uniqueId could be chosen again, but since the content of OAI-PMH harvests are always consistent with respect to these things (as is the AdLib harvest), the best approach would be to simply ensure that all of the generated mappings respect the recordRoot = "/OAI-PMH/ListRecords/record" and create paths accordingly.

geralddejong commented 9 years ago

The remaining issue appears to be the migration of data using existing mappings, since the OAI-PMH header-identifier and record root may not be the one used already. The approach to solve this will involve a change in the way first-harvest is done. It would be best if it harvested one single page, analyzed it, and then allowed for the adjusting of recordRoot and uniqueId (from the defaults) before a full harvest is done.

This is already much better than the current harvest workflow because it is currently somewhat blind to what format is being made available, and whether or not the server is indeed working at all.

geralddejong commented 9 years ago

Here is an example case where the identifier delivered by OAI-PMH may be very different from the one that should be used for generating URIs:

   <record>
        <header>
            <identifier>00000844-73b9-11e4-a7d8-4f532f726acd</identifier>
        </header>
        <metadata>
            <oai_dc:dc ...>
                <dc:identifier>K18737</dc:identifier>
                ...
            </oai_dc:dc>
        </metadata>
    </record>