FLVC / offline-ingest

A rubydora application to do digitool migrations, and eventually affiliate-submitted ingests, into floridora
1 stars 0 forks source link

TOC that might need nested DIVs; TOC not linking to pages #25

Closed lydiam closed 7 months ago

lydiam commented 9 years ago

The following are PIDs of some UF books where the TOC file doesn't link correctly to pages. It looks like the mets.xml structMap and TOC should have nested DIVs but do not.

Examples: uf:57583, uf:58136

grf commented 9 years ago

On Mon, Jun 8, 2015 at 11:32 AM, Lydia Motyka notifications@github.com wrote:

The following are PIDs of some UF books where the TOC file doesn't link correctly to pages. It looks like the mets.xml structMap and TOC should have nested DIVs but do not.

Examples: uf:57583, uf:58136

I am unable to repeat your problem.

Please do the smallest step-by-step list of how to reproduce your error: what you click on, what you see, and what you think you should.

Thanks.

lydiam commented 9 years ago

Specifics for uf:57583:

This is a complex problem and the GH issue was intended as a placeholder as the problem is investigated. My guess is that repeating ranges of pages in the structMap/TOC has something to do with the problem, but I'm not certain of the cause. Other examples will be added.

grf commented 9 years ago

On Mon, Jun 8, 2015 at 1:39 PM, Lydia Motyka notifications@github.com wrote:

  • Click on Table of Contents Tab
  • Open the Methods node and note that it lists pages 3-5
  • Open the following 2 nodes, "--Habitat Classification" and "--Rarefaction" and note that pages 3-5 are repeated.

Check, repeat all that (I could argue this is a feature, but let's not go there....)

  • Click on p.5 under "--Rarefaction" and note that you're taken to p.4.

Nope: I'm taken to a two-up view of pages 4 and 5.

clicking on pages > 5 from the TOC takes you to the wrong page. At the moment I'm guessing that the above DIVs are where the problem starts.

Nope: I went back to the TOC page, opened up Discussion, and clicked on page 16. I was presented with a two-up view of pages 16 and 17.

That part is looking like a browser cache bug to me.

lydiam commented 9 years ago

I'm using FireFox. Natalie has had problems as well - she reported these objects, but I don't know which browser she uses.

Using Firefox again, on uf:58136, the TOC from the "Checklist..." section in the beginning, when I click on p6 I'm presented with p1-2, (although the page marker shows p8) and thereafter any page on click on from the TOC retrieves the same results. The initial click on p6 is being done from my laptop which has never accessed this object before. I get the exact same behavior using Chrome, again from my laptop.

Why would there be a browser caching problem with some objects and not others, and how can a browser from the start "cache" the wrong page?

grf commented 9 years ago

On Mon, Jun 8, 2015 at 2:34 PM, Lydia Motyka notifications@github.com wrote:

Why would there be a browser caching problem with some objects and not others, and how can a browser from the start "cache" the wrong page?

TBD - just a guess. Let's take this offline.

lydiam commented 9 years ago

Another example: uf:60409. Clicking on p10 from the TOC using Chrome displays pages 4-5.

grf commented 9 years ago

On Mon, Jun 8, 2015 at 3:16 PM, Lydia Motyka notifications@github.com wrote:

Another example: uf:60409. Clicking on p10 from the TOC using Chrome displays pages 4-5.

This one I can repeat and it is clearly a bug. Thanks for turning up this example.

lydiam commented 9 years ago

A snippet of the JSON TOC for uf:58136. Should the "pagenum" for so many different pages be "4"? The METS structMap doesn't repeat the same FILEID in those divs.

"title": "Checklist of the Woody Cultivated Plants of Florida", "type": "chapter", "level": 1, "pagenum": 4 }, { "title": "1", "type": "page", "level": 2, "pagenum": 4 }, { "title": "2", "type": "page", "level": 2, "pagenum": 4 }, { "title": "3", "type": "page", "level": 2, "pagenum": 4 }, { "title": "4", "type": "page", "level": 2, "pagenum": 4 }, { "title": "5", "type": "page", "level": 2, "pagenum": 4 }, { "title": "6", "type": "page", "level": 2, "pagenum": 4 }, { "title": "7", "type": "page", "level": 2, "pagenum": 4 }, { "title": "8", "type": "page", "level": 2, "pagenum": 4 },

grf commented 9 years ago

On Mon, Jun 8, 2015 at 3:30 PM, Lydia Motyka notifications@github.com wrote:

A snippet of the JSON TOC for uf:58136. Should the "pagenum" for so many different pages be "4"? The METS structMap doesn't repeat the same FILEID in those divs.

Absolutely not, if they're of type page. So there's a problem interpreting the DIVs but let's wait on this until I'm done with current work. I'll revisit the structmap parsing code.

lydiam commented 9 years ago

Another example of a TOC pointing incorrectly and repeatedly at the same "pagenum" Object uf:60939

A snippet from the TOC file that causes problems in the display:

{ "title": "6", "type": "page", "level": 2, "pagenum": 19 }, { "title": "7", "type": "page", "level": 2, "pagenum": 19 }, { "title": "8", "type": "page", "level": 2, "pagenum": 19 }, { "title": "9", "type": "page", "level": 2, "pagenum": 19 }, { "title": "10", "type": "page", "level": 2, "pagenum": 19 }, { "title": "11", "type": "page", "level": 2, "pagenum": 19 }, { "title": "12", "type": "page", "level": 2, "pagenum": 19 },

An earlier snippet from the file that seems to display more or less fine: "title": "Introduction", "type": "chapter", "level": 1, "pagenum": 14 }, { "title": "1", "type": "page", "level": 2, "pagenum": 14 }, { "title": "2", "type": "page", "level": 2, "pagenum": 14 }, { "title": "3", "type": "page", "level": 2, "pagenum": 14 }, { "title": "--Location", "type": "chapter", "level": 1, "pagenum": 14 }, { "title": "1", "type": "page", "level": 2, "pagenum": 14

wrandtkeflvc commented 9 years ago

A. The main table of contents linking issue is from page numbers repeating in a table of contents. It occurs in the following 3 situations:

1) Table of contents lists all pages in a chapter, then separately lists all pages in a subchapter, hence duplicating page numbers:

When a book has a chapter heading, then lists all the pages in that chapter, then has a subchapter heading, then lists all the pages in that subchapter: A given page number will occur in the chapter, then recur in the subchapter, but some other page numbers will appear between them, and all those pages between them link to that earlier page. For example, here https://uf.digital.flvc.org/islandora/object/uf%3A46102/toc (Digitool PID no 894304), you can pull this up by clicking open the chapter "Systematic Account", then clicking to open the subchapters under that "Key to Tribes of Ichneumoninae of Florida and Neighboring States" "Tribe Protichneumonini" "Genus Protichneumon Thomson" etc. If you mouse over the links for individual pages and look at the destination along the bottom of your browser, you will see many many pages linking incorrectly to a view of an earlier page because that earlier page number repeats later in the table of contents.

In the METS standard, the definition of File Pointer Attributes, FILEID (IDREF/O) states that “A element should only have a FILEID attribute value if it does not have a child , or element. If it has a child element, then the responsibility for pointing to the relevant content falls to this child element or its descendants.” That implies that subchapters should be nested, and it’s not anticipated to list them sequentially after the chapter and at the same level, which is how the problem records were set up. So, this can be viewed as a metadata problem and not something to be able to know is going to happen just reading the METS standard.

2) Table of Contents lists all the pages, and also separately lists illustrations or tables, meaning page numbers repeat in the table of contents:

For example, https://uf.digital.flvc.org/islandora/object/uf%3A98003/toc (Digitool PID 903518) has a table of contents heading for "Figures" and for "Tables" near the beginning of the table of contents which link all over the book whereever there is a figure or table. Then the table of contents lists Chapter 1, which starts with page 1. If you mouse over links in the "Figures" and "Tables" sections, you will see every link goes to page 1, because page 1 is repeated later in the chapter.

This is because the table of contents is logical. The METS structmap doesn't label it logical versus physical.

Bonus problem! (not a table of contents issue): If the appendix is at the beginning like this, then the pages are also out of order in the page turner, because all pages with figures or tables show up in the table of contents first. That happens with this one. But, it's a separate problem from the incorrect links in the table of contents. A fix to the out of order pages would be to put in guidelines to participating libraries requiring a physical structmap be provided. For an object that comes in with no physical structmap at all and the logical structmap isn't labeled as logical, there is no way other than manual review to screen this problem.

3) Typos in the table of contents:

Randomly, there will be a typo in a table of contents where the wrong number appear somewhere. For example in https://uf.digital.flvc.org/islandora/object/uf%3A49429 (Digitool PID 934818), in the chapter on "Miami Canal at Palmetto bypass, near Hialeah, Fla." pages 275 and 276 are erroeously labeled 175 and 176.

This occured in about 1% of the audited books, where a typo resulted in misdirected table of contents but no additional problems (ie. where a low numbered page was listed later in the book due to typo).

Bonus problem!: You might wonder, what about a typo where a high numbered page occurs early in the book? When that happens, pages are out of order, and it's more severe than just mislinked table of contents. Typos resulting in pages being out of order occured in about 2% of audited books, but is a problem from bad metadata. Trying to flag those programatically, for example, by looking at all numeric sequences and flagging for manual review ones where pages aren't in order might be a separate issue to consider.

B. There can also be a table of contents linking problem when there are multiple structmaps.

The METS file has multiple structmaps:

For example, https://uf.digital.flvc.org/islandora/object/uf%3A98435 (Digitool PID 909039). Here, Digitool has a table of contents view and a page turner view. The Islandora object has taken page order from the page turner, and taken table of contents linking from the table of contents view, but not accounted for that in the linking. This is a logic problem: Links should be matched on object, but were instead put in as if the object order came from the table of contents.

Because structmaps aren't labeled physical versus logical, it might be good to require contributing libraries to label each structmap as physical or logical. Because, as of today, the parser is making that guess, here is a complete list of objects this affects (complete within the approx 1,400 UF PALMM books, but not complete across FLVC collections): https://uf.digital.flvc.org/islandora/object/uf%3A98435 (Digitool PID 909039) https://uf.digital.flvc.org/islandora/object/uf%3A59281 (Digitool PID 909351) https://uf.digital.flvc.org/islandora/object/uf%3A59965 (Digitool PID 913947) (but this one has some underlying metadata problems with the METS, and not just this problem)

lydiam commented 9 years ago

The following is my best understanding of how the book loading (and possibly newspaper issue loader as well) program processes the METS for TOC files. (Randy, please correct any misunderstandings):

  1. That the books loader uses the METS structMap to both determine which page objects will be loaded and the order of the pages in the book in the Islandora book object. The same METS structMap is used to create the JSON TOC file (see http://wiki.fcla.edu/wiki/index.php/DL:Islandora_DT_2_IA_Book_Reader_Spec for Caitlin's notes on METS to JSON)
  2. The order of the pages in the Islandora book object is the order used by the IA Bookreader in the View/page-turner display.
  3. In cases where there is more than one structMap, the books loader uses the "longest" structMap. (This assumption is based on the load warning message "Multiple structMaps found in METS file, discarding the shortest (least number of referenced files)." This appears to be the only logic currently in place for selecting which structMap to use.
grf commented 9 years ago

On Mon, Aug 31, 2015 at 12:25 PM, Lydia Motyka notifications@github.com wrote:

  1. That the books loader uses the METS structMap to both determine which page objects will be loaded and the order of the pages in the book in the Islandora book object. The same METS structMap is used to create the JSON TOC file (see http://wiki.fcla.edu/wiki/index.php/DL:Islandora_DT_2_IA_Book_Reader_Spec for Caitlin's notes on METS to JSON)

Yes, only the structmap is used for both sequencing and logical structure.

  1. The order of the pages in the Islandora book object is the order used by the IA Bookreader in the View/page-turner display.

Yes, though Gail would have to speak to any subtleties.

  1. In cases where there is more than one structMap, the books loader uses the "longest" structMap. (This assumption is based on the load warning message "Multiple structMaps found in METS file, discarding the shortest (least number of referenced files)." This appears to be the only logic currently in place for selecting which structMap to use.

Well, there is a ranking, so for instance, with two otherwise identical structmaps, one of PDFs and the other of images, the later structmap will be selected.

lydiam commented 9 years ago

More details on case B. examples:

The pertinent portion of the structMap:

  <METS:div LABEL="Floridas Prohibited Aquatic Plants" TYPE="section">
           <METS:div LABEL="5" TYPE="page">
             <METS:fptr FILEID="FID6"/>
           </METS:div>
           <METS:div LABEL="6" TYPE="page">
             <METS:fptr FILEID="FID7"/>
           </METS:div>
           <METS:div LABEL="7" TYPE="page">
             <METS:fptr FILEID="FID8"/>
           </METS:div>
           <METS:div LABEL="8" TYPE="page">
             <METS:fptr FILEID="FID9"/>
           </METS:div>
           <METS:div LABEL="9" TYPE="page">
             <METS:fptr FILEID="FID10"/>
           </METS:div>
           <METS:div LABEL="10" TYPE="page">
             <METS:fptr FILEID="FID11"/>
           </METS:div>
         </METS:div>

  <METS:div LABEL="Glossary" TYPE="section">
           <METS:div LABEL="11" TYPE="page">
             <METS:fptr FILEID="FID12"/>
           </METS:div>

However, later in the structMap references to the FILEID FID16, FID17, etc. are repeated out of sequence.

First reference to these files in the METS TYPE=JPEG structMap:

      <METS:div LABEL="Categories of Aquatic Plants" TYPE="section">
                <METS:div LABEL="14" TYPE="page">
                  <METS:fptr FILEID="FID15"/>
              </METS:div>
                <METS:div LABEL="15" TYPE="page">
                 <METS:fptr FILEID="FID16"/>
                </METS:div>
               <METS:div LABEL="16" TYPE="page">
                  <METS:fptr FILEID="FID17"/>
                </METS:div>
                <METS:div LABEL="17" TYPE="page">
                <METS:fptr FILEID="FID18"/>
              </METS:div>

Second reference to these files in the METS TYPE=JPEG structMap:

     <METS:div LABEL="Araceae" TYPE="section">
         <METS:div LABEL="15" TYPE="page">
             <METS:fptr FILEID="FID16"/>
           </METS:div>
           <METS:div LABEL="16" TYPE="page">
             <METS:fptr FILEID="FID17"/>
           </METS:div>
          <METS:div LABEL="20" TYPE="page">
             <METS:fptr FILEID="FID21"/>
           </METS:div>
           <METS:div LABEL="21" TYPE="page">
            <METS:fptr FILEID="FID22"/>

The corresponding sections of the JSON file:

{ "level": 1, "type": "chapter", "title": "Categories of Aquatic Plants", "pagenum": 10 }, { "level": 2, "type": "page", "title": "14", "pagenum": 10 }, { "level": 2, "type": "page", "title": "15", "pagenum": 11 }, { "level": 2, "type": "page", "title": "16", "pagenum": 11 }, { "level": 2, "type": "page", "title": "17", "pagenum": 11 },

and

{
  "level": 1,
  "type": "chapter",
  "title": "Araceae",
  "pagenum": 11
},
{
  "level": 2,
  "type": "page",
  "title": "15",
  "pagenum": 11
},
{
  "level": 2,
  "type": "page",
  "title": "16",
  "pagenum": 12
},
{
  "level": 2,
  "type": "page",
  "title": "20",
  "pagenum": 13
},
{
  "level": 2,
  "type": "page",
  "title": "21",
  "pagenum": 14
},
{
  "level": 2,
  "type": "page",
  "title": "22",
  "pagenum": 15
},

So in this case it appears that duplication of FILEID references in the structMap, out of sequence, affects the entire JSON file, even in sections where the corresponding structMap is entirely correct.

If there's a routine we can use to regenerate the JSON file from the mets.xml file it's probably worth trying, to determine if a second try gives a better result.

lydiam commented 9 years ago

The second example from section B above:

https://uf.digital.flvc.org/islandora/object/uf%3A59281, at the end of the structMap TYPE=JPEG FILEIDs are repeated in an "Illustrations" section:

 <METS:div LABEL="Back cover" TYPE="section">
 <METS:div LABEL="cover3" TYPE="page">
 <METS:fptr FILEID="FID312"/>
  </METS:div>
  <METS:div LABEL="cover4" TYPE="page">
    <METS:fptr FILEID="FID313"/>
    </METS:div>
    </METS:div>
   <METS:div LABEL="Illustrations" TYPE="section">
    <METS:div LABEL="30a" TYPE="page">
    <METS:fptr FILEID="FID39"/>
    </METS:div>
    <METS:div LABEL="30b" TYPE="page">
     <METS:fptr FILEID="FID40"/>
    </METS:div>
   <METS:div LABEL="58a" TYPE="page">
     <METS:fptr FILEID="FID314"/>
     </METS:div>
     <METS:div LABEL="58b" TYPE="page">
     <METS:fptr FILEID="FID315"/>
     </METS:div>
     <METS:div LABEL="72a" TYPE="page">
     <METS:fptr FILEID="FID83"/>
     </METS:div>
     <METS:div LABEL="72b" TYPE="page">
     <METS:fptr FILEID="FID84"/>
       </METS:div>
       <METS:div LABEL="82a" TYPE="page">
        <METS:fptr FILEID="FID95"/>
        </METS:div>

The JSON file

/METS:div /METS:div /METS:div /METS:div /METS:div /METS:div /METS:div The JSON file seems to start going wrong at the _first_ reference to the repeated pages: { "title": "30a", "type": "page", "level": 2, "pagenum": 39 }, { "title": "30b", "type": "page", "level": 2, "pagenum": 39 }, { "title": "31", "type": "page", "level": 2, "pagenum": 39 }, { "title": "32", "type": "page", "level": 2, "pagenum": 40 }, { The second references to those same pages seem to be fine in the JSON file: { "title": "Illustrations", "type": "chapter", "level": 1, "pagenum": 302 }, { "title": "30a", "type": "page", "level": 2, "pagenum": 302 }, { "title": "30b", "type": "page", "level": 2, "pagenum": 303 }, { "title": "58a", "type": "page", "level": 2, "pagenum": 304 }, { "title": "58b", "type": "page", "level": 2, "pagenum": 305 },