FLVC / offline-ingest

A rubydora application to do digitool migrations, and eventually affiliate-submitted ingests, into floridora
1 stars 0 forks source link

batch loading of newspapers #17

Closed lydiam closed 6 months ago

lydiam commented 9 years ago

DISC has requested that we develop a mechanism for batch loading newspapers. FSU has found that it takes approx. 11 hours to load one issue of "Flambeau" (loading of individual pages takes a very long time). In addition, UF has indicated a desire to migrate the Florida Digital Newspaper Library (approx 100,000 issues from UF and other institutions) from Sobek into Islandora.

In an 11/5/2015 meeting with Gail, Randy, Lydia, the following was proposed:

lydiam commented 9 years ago

This programming needs specs that need to be presented to ISG before work begins.

grf commented 9 years ago

See: http://wiki.fcla.edu/wiki/index.php/DL:Unattended_Batch_Ingest_Newspapers

lydiam commented 9 years ago

Randy will provide a sample package & Lydia will take the high-level specs to ISG.

lydiam commented 9 years ago

This is the version approved by ISG:

FL-Islandora Newspaper Batch Load Specifications: Phase I

This document specifies the unattended batch processing of newspaper documents for Islandora. The system will be developed in two phases,the first of which is addressed here. This is Version 1.0 of the specification; subsequent specification documents will be issued.

The first phase of this system will require that the top-level newspaper object be ingested interactively with the Islandora web site as usual, and the ingested digital ID be recorded for us for the unattended batch of issues and individual documents. For below, we will use the example ID of "fsu:109142", "Florida Flambeau". Top-level newspaper objects are assigned the type of "islandora:newspaperCModel".

Issues will be delivered in packages, consisting as usual of a one directory using the institution's unique package naming scheme. We use the example "FSUPACKAGE01" below. Packages may be delivered using FLVC's FTP server and queued for processing. Other arrangements may be made for delivery to FLVC for staff processing.

There are 3 + N files included in the delivered package; 3 are XML meta-data files, and an arbitrary number of image files making up the serial data.

1) A MODS file, using the directory (package) name MUST be supplied. For the above example the filename would be "FSUPACKAGE01.xml".

The MODS file MUST include the date issued for the serial:

   <dateIssued encoding="w3cdtf">1915-01-23</dateIssued>

Any additional MODS metadata will be displayed in the "Issue Details" tab of the issue. Note that the "Newspaper Details" tab will always display the serial title metadata, so issue level metadata can be as brief or extensive as prefered.

The MODS file MAY include a language entry; that language will be used when performing OCR text extraction on supplied images. If ommitted, English will be used. Currently installed language support includes English (eng) or Italian (itl); more will be added as necessary. The complete list of available supported languages can be viewed at

   https://code.google.com/p/tesseract-ocr/downloads/list

In MODS, languages are declared as in the following stanza:

   <language>
      <languageTerm type="text" authority="iso639-2b">English</languageTerm>
      <languageTerm type="code" authority="iso639-2b">eng</languageTerm>
   </language>

A PURL entry MAY be included; otherwise one will be generated automatically:

   <location displayLabel="purl">
       <url>http://purl.flvc.org/fsu/fd/FSU_Flambeau_01231915</url>
   </location>

2) The "manifest.xml" includes ancillary information required for processing. It is not directly saved by the islandora system.

     <?xml version="1.0" encoding="UTF-8"?>
     <manifest xmlns="info:flvc/manifest/v1">
     <contentModel>islandora:newspaperIssueCModel</contentModel>
     <collection>fsu:109142</collection>
     <submittingInstitution>FSU</submittingInstitution>
     <owningInstitution>FSU</owningInstitution>
     <owningUser>John Smith</owningUser>
     </manifest>

No new elements have been added to the "manifest.xml" file; as usual, an "owningUser" element giving the log-in name of a valid Islandora user must be included.

The "contentModel" element MUST be "islandora:newspaperIssueCModel" and the "collection" element MUST be the digital ID of the top-level newspaper object to which this serial belongs ("fsu:109142" in our example). The digital id MUST exist and be of type "islandora:newspaperCModel".

3) A collection of N image files, where the collection is comprised of images of exactly one of the formats TIFF, JPEG, or JP2K.

4) A "mets.xml" file that describes the structure of the issue MUST be included. The "mets.xml" file will be used to sequence the supplied images into pages, as well as generate a table of contents for display, similar to that used by the Book package.

The relevant parts of the "mets.xml" file are the "fileSec" and "structMap" sections; as an example, if two images were supplied the "mets.xml" file could contain:

   <METS:fileSec>
      <METS:fileGrp USE="reference" >
      <METS:file GROUPID="G1" ID="TIF1" MIMETYPE="image/tiff">
      <METS:FLocat LOCTYPE="OTHER" xlink:href="page_001.tif" />
      </METS:file>
      <METS:file GROUPID="G1" ID="TIF2" MIMETYPE="image/tiff">
       <METS:FLocat LOCTYPE="OTHER" xlink:href="page_002.tif" />
     </METS:file>
     ....
    </METS:fileGrp>
   </METS:fileSec>

   <METS:structMap LABEL="View" TYPE="TIFF">
     <METS:div DMDID="DMD1">
         <METS:div LABEL="Florida Flambeau, January 23, 1915" TYPE="chapter">
           <METS:div LABEL="Page 01" TYPE="page">
             <METS:fptr FILEID="TIF1"/>
           </METS:div>
          <METS:div LABEL="Page 02" TYPE="page">
             <METS:fptr FILEID="TIF2"/>
           </METS:div>
       ....
        </METS:div>
     </METS:div>
   </METS:structMap>

Currently, no use is made of "GROUPID" or "LOCTYPE". Sequence information is implicit from the ordering appearing in the "structMap" section. Any supplied labels are used for displaying the table of content and page-object labeling - it is never used in other Islandora meta-data.

5) As usual, after a package ingest completes, an entry will be added to the "admin.institution.digital.flvc.org" administration site. Any errors or warnings encountered during ingest will be listed there.

lydiam commented 9 years ago

[lydiam@islandorat SF10327872_0060_001]$ package --test --server usf-test /ssa/d2i/USF_flaming_sword_converted/SF10327872_0060_001 Processing 1 package: /ssa/d2i/USF_flaming_sword_converted/SF10327872_0060_001 Invalid package in /ssa/d2i/USF_flaming_sword_converted/SF10327872_0060_001. 0.00 sec, 0.00 MB NewspaperIssuePackage::SF10327872_0060_001 (no pid) => collection: usf:565, "The Flaming sword: Volume 60 Number 1" Errors: The package MODS file has a badly formatted w3cdtf-encoded dateIssued element: it should be of the form 'YYYY-MM-DD' but is '1946-01'

Warnings:
Multiple structMaps found in METS file, discarding the shortest (least number of referenced files).
The Newspaper Issue Package SF10327872_0060_001 has the following 2 unexpected files that will not be processed:
 - file1.pdf
 - rename.txt
lydiam commented 9 years ago

It looks like the ISO 8601/w3cdtf allows 6 levels of encoding, and in my opinion we should accept YYYY-MM-DD YYYY-MM:

From http://www.w3.org/TR/NOTE-datetime-970915.html:

Formats

Different standards may need different levels of granularity in the date and time, so this profile defines six levels. Standards that reference this profile should specify one or more of these granularities. If a given standard allows more than one granularity, it should specify the meaning of the dates and times with reduced precision, for example, the result of comparing two dates with different precisions.

The formats are as follows. Exactly the components shown here must be present, with exactly this punctuation. Note that the "T" appears literally in the string, to indicate the beginning of the time element, as specified in ISO 8601.

Year: YYYY (eg 1997) Year and month: YYYY-MM (eg 1997-07) Complete date: YYYY-MM-DD (eg 1997-07-16) Complete date plus hours and minutes: YYYY-MM-DDThh:mmTZD (eg 1997-07-16T19:20+01:00) Complete date plus hours, minutes and seconds: YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30+01:00) Complete date plus hours, minutes, seconds and a decimal fraction of a second YYYY-MM-DDThh:mm:ss.sTZD (eg 1997-07-16T19:20:30.45+01:00)

where:

 YYYY = four-digit year
 MM   = two-digit month (01=January, etc.)
 DD   = two-digit day of month (01 through 31)
 hh   = two digits of hour (00 through 23) (am/pm NOT allowed)
 mm   = two digits of minute (00 through 59)
 ss   = two digits of second (00 through 59)
 s    = one or more digits representing a decimal fraction of a second
 TZD  = time zone designator (Z or +hh:mm or -hh:mm)

This profile does not specify how many digits may be used to represent the decimal fraction of a second. An adopting standard that permits fractions of a second must specify both the minimum number of digits (a number greater than or equal to one) and the maximum number of digits (the maximum may be stated to be "unlimited").

grf commented 9 years ago

On Mon, May 18, 2015 at 9:38 AM, Lydia Motyka notifications@github.com wrote:

It looks like the ISO 8601/w3cdtf allows 6 levels of encoding, and in my opinion we should accept YYYY-MM-DD YYYY-MM:

From http://www.w3.org/TR/NOTE-datetime-970915.html:

I'll see your W3.ORG and raise you one LOC.GOV:

From http://loc.gov/standards/mods/v3/mods-userguide-generalapp.html

Date Attributes. Certain date attributes may be applied to some MODS elements, as indicated in the schema. They are defined below:

lydiam commented 9 years ago

You are correct. I read over the MODS document but hadn’t picked up on the subtleties. No change required in that case, although at some point we may want to consider enhancing the code to include with no encoding attribute.

I think that the code may be ready for all Islandora servers now, but I have one question about testing: you said that the newspaper code now varies a lot more from the bookCM code than originally anticipated. Is there any reason to think that the code to load newspaper issues impacts the books code? In other words: do we need to re-test loads of book CM materials and/or any other materials?

Lydia

grf commented 9 years ago

Retest would good. Retesting books as well necessary.

grf commented 9 years ago

Admin interface fixed (content-model parameter was being dropped by web service controller)'

Ready to retest and close this thread.

lydiam commented 9 years ago

This afternoon I test-loaded some additional newspaper issues and a book, and all loaded correctly, and the admin interface Content Type filter correctly identified and displayed objects in each content model. I believe that the code is ready for production early next week.

lydiam commented 9 years ago

Enhancement request: for Book CM packages with no mets.xml file the package program gives the following error: The Book Package UF00004152_00001 doesn't contain a mets.xml file. Newspaper Issue CM packages without mets.xml files should give the same error.

Currently the package program, when encountering a newspaper issue package without a mets.xml file, does the following:

[lydiam@islandorat ~]$ package --server usf-test /ssa/d2i/USF_flaming_sword_converted/SF10327872_0062_001 Processing 1 package: /ssa/d2i/USF_flaming_sword_converted/SF10327872_0062_001 Invalid package in /ssa/d2i/USF_flaming_sword_converted/SF10327872_0062_001. 0.00 sec, 0.00 MB NewspaperIssuePackage::SF10327872_0062_001 (no pid) => no collections, "" Errors: The IID for this package, SF10327872_0062_001, is alreading being used for islandora object usf:659. The IID must be unique. Exception RuntimeError - file '/ssa/d2i/USF_flaming_sword_converted/SF10327872_0062_001/1.jpg/10.jpg/11.jpg/12.jpg/2.jpg/3.jpg/4.jpg/5.jpg/6.jpg/7.jpg/8.jpg/9.jpg' not found for Newspaper Issue Package SF10327872_0062_001, backtrace follows: /usr/local/islandora/offline-ingest/lib/offin/utils.rb:881:in mime_type' /usr/local/islandora/offline-ingest/lib/offin/packages.rb:1012:incheck_page_types' /usr/local/islandora/offline-ingest/lib/offin/packages.rb:1010:in each' /usr/local/islandora/offline-ingest/lib/offin/packages.rb:1010:incheck_page_types' /usr/local/islandora/offline-ingest/lib/offin/packages.rb:1416:in initialize' /usr/local/islandora/offline-ingest/lib/offin/packages.rb:80:innew' /usr/local/islandora/offline-ingest/lib/offin/packages.rb:80:in new_package' /usr/local/bin/package:57 /usr/local/bin/package:51:ineach' /usr/local/bin/package:51

Warnings:
The Newspaper Issue Package SF10327872_0062_001 has the following 15 unexpected files that will not be processed:
 - 1.jpg
 - 10.jpg
 - 11.jpg
 - 12.jpg
 - 2.jpg
 - 3.jpg
 - 4.jpg
 - 5.jpg
 - 6.jpg
 - 7.jpg
 - 8.jpg
 - 9.jpg
 - file1.pdf
 - notmets
 - rename.txt
grf commented 9 years ago

On Fri, May 22, 2015 at 4:53 PM, Lydia Motyka notifications@github.com wrote:

Enhancement request: for Book CM packages with no mets.xml file the package program gives the following error: The Book Package UF00004152_00001 doesn't contain a mets.xml file. Newspaper Issue CM packages without mets.xml files should give the same error.

I thought we'd agreed that METS files were optional for newspapers, that we'd sort by filename as best we could.

That is an error in the filename processing though.

I'll make it be an error.

-Randy

lydiam commented 9 years ago

All of the specs I've read indicate that METS is required. Making METS optional would be a nice enhancement. At the moment the program requires a METS file, as far as I can determine.

grf commented 9 years ago

On Fri, May 22, 2015 at 5:26 PM, Lydia Motyka notifications@github.com wrote:

All of the specs I've read indicate that METS is required. Making METS optional would be a nice enhancement. At the moment the program requires a METS file, as far as I can determine.

Well, that was a bug - it now operates correctly without METS.

That being said, I have inserted a check for METS, it will error out. We can always make METS optional again.

lydiam commented 9 years ago

I just discovered a problem: by mistake I changed the collection in a manifest to point at a newspaper parent, but the contentModel was still "bookCModel". The book loaded without error as a child of the newspaper title, but it does not display. I can delete the book and redo the manifest file, but I was on the verge of submitting a batch of newspaper issues and consequently might have done a lot of unnecessary loading and subsequent deleting. It would be useful for the program to give an error when the collection is a PID but the contentModel is not a newspaper issue.

grf commented 9 years ago

On Tue, Jun 2, 2015 at 12:01 PM, Lydia Motyka notifications@github.com wrote:

It would be useful for the program to give an error when the collection is a PID but the contentModel is not a newspaper issue.

Well, I'm a bit confused: the collection id should be a newspaperCModel - and it appears to be checked - the issue PID is created and content model assigned as newspaperIssueCModel.

So I'll need to see an example bad manifest and package from you.

lydiam commented 9 years ago

I’ve deleted the Book/Issue because I need to move forward with the production load. Here’s the reference from the admin database: http://admin.uf.digital.flvc.org/packages/83859

The parent newspaper object is still there so you can confirm that it is a newspaper object.

Lydia

From: Randy Fischer [mailto:notifications@github.com] Sent: Tuesday, June 02, 2015 12:33 PM To: FLVC/offline-ingest Cc: Lydia Motyka Subject: Re: [offline-ingest] batch loading of newspapers (#17)

On Tue, Jun 2, 2015 at 12:01 PM, Lydia Motyka notifications@github.com wrote:

It would be useful for the program to give an error when the collection is a PID but the contentModel is not a newspaper issue.

Well, I'm a bit confused: the collection id should be a newspaperCModel - and it appears to be checked - the issue PID is created and content model assigned as newspaperIssueCModel.

So I'll need to see an example bad manifest and package from you.

— Reply to this email directly or view it on GitHubhttps://github.com/FLVC/offline-ingest/issues/17#issuecomment-108007656.