HistoryAtState / hsg-shell

Source code for the history.state.gov website
https://history.state.gov
9 stars 13 forks source link

Improve parsing of frus-history index cross references #352

Open joewiz opened 4 years ago

joewiz commented 4 years ago

In the frus-history index (see the source TEI), range-based cross references are encoded in a unique way, which hsg-shell's ODD isn't parsing the same way as our pre-TEI Publisher site, and seems to be causing server errors.

Here is a sample encoded cross reference:

<item>
    <term>Aandahl, Fredrick</term>, <ref target="#range(b_446-start,b_446-end)"
        >196–199</ref>, <ref target="#b_447">203</ref>, <ref target="#b_448"
        >206</ref>
</item>

The syntax used in the first of these two @target attributes is based on the TEI Guidelines' support for XPointer; I only use the range pointer scheme. Specifically, the cross reference points to the range between two <anchor> elements with @xml:id elements in the body of the book:

  1. Line 11548
    <anchor xml:id="b_446-start" corresp="#b_446-end"/>
  2. Line 11711
    <anchor xml:id="b_446-end" corresp="#b_446-start"/>

My original handling for this, on our pre-TEI Publisher-based website, was to examine where the targets were located, and replace the book's original "196–199, 203, 206" with a web-relevant description of the target section, e.g., "Ch. 8 paras 34–39, Ch. 8 para 47, Ch. 8 para 52".

The Internet Archive contains a snapshot of the old rendering of the page.

"Ch. 8 paras 34–39, Ch. 8 para 47, Ch. 8 para 52" were given the URLs:

However, the current hsg site fails to parse the links correctly, generating URLs like this:

Our website performs a 302 redirect when these URLs, respectively, to:

... which appears to be a graceful recovery, but @windauer reported finding errors in the logs:

2019-12-20 10:40:09,297 [qtp731870416-10326] ERROR (DeferredFunctionCall.java [isEmpty]:203) - Exception in deferred function: not-found publication frus-history-monograph document frus-history section b_806 not found [at line 99, column 13, source: /db/apps/hsg-shell/modules/pages.xqm]
In function:
    pages:load-fallback-page(xs:string, xs:string, xs:string?) [85:13:/db/apps/hsg-shell/modules/pages.xqm]
    pages:load-xml(xs:string, xs:string, xs:string?, xs:string, xs:boolean?) [49:67:/db/apps/hsg-shell/modules/pages.xqm]
    pages:load(node(), map(*), xs:string?, xs:string?, xs:string?, xs:string, xs:boolean) [-1:-1:/db/apps/hsg-shell/modules/pages.xqm]
    templates:process-output(element(), map(*), item()*, element()) 
   ....

This error comes ~ 10 x time in a row followed by:

2019-12-20 10:40:09,300 [qtp731870416-10326] WARN  (HttpChannel.java [handleException]:591) - /exist/apps/hsg-shell/historicaldocuments/frus-history/b_806 
javax.servlet.ServletException: javax.servlet.ServletException: An error occurred while processing request to /exist/apps/hsg-shell/historicaldocuments/frus-history/b_806: Committed
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:162) ~[jetty-server-9.4.24.v20191120.jar:9.4.24.v20191120]
        ...
    ... 18 more

Here is the original code I wrote to transform the links:

(: handle xpointer-style range references, as found in the frus-history, e.g.,
    index entries like: 
        <term>Washington, George</term>, <ref target="#range(b_37-start,b_37-end)">9–10</ref>
    point to:
        <anchor xml:id="b_37-start" corresp="#b_37-end"/>
    and:
        <anchor xml:id="b_37-end" corresp="#b_37-start"/>
:)
else if (starts-with($target, '#range')) then
    let $range := substring-after($target, '(')
    let $range := substring-before($range, ')')
    let $range := tokenize($range, ',')
    let $range-start := $range[1]
    let $range-end := $range[2]
    let $target-start-node := root($node)/id($range-start)
    let $target-end-node := root($node)/id($range-end)
    (: use ancestor notes to ensure linkability :)
    let $target-start-node := if ($target-start-node/ancestor::tei:note) then $target-start-node/ancestor::tei:note else $target-start-node
    let $target-end-node := if ($target-end-node/ancestor::tei:note) then $target-end-node/ancestor::tei:note else $target-end-node
    let $target-start-node-ancestor-div := $target-start-node/ancestor::tei:div[1]
    let $target-end-node-ancestor-div := $target-end-node/ancestor::tei:div[1]
    let $same-ancestor-divs := $target-start-node-ancestor-div is $target-end-node-ancestor-div
    (: use the ancestor chapter div's heading, e.g., "Chapter 9: ...", but chop off at the colon :)
    let $target-nodes := ($target-start-node, $target-end-node)
    let $target-divs := ($target-start-node-ancestor-div, $target-end-node-ancestor-div)
    let $target-node-labels := 
        let $both-notes := $target-nodes[1]/self::tei:note and $target-nodes[2]/self::tei:note
        let $one-note := $target-nodes[1]/self::tei:note or $target-nodes[2]/self::tei:note
        for $target-node at $n in $target-nodes
        let $ancestor-div-label :=
            if ($same-ancestor-divs and $n = 2) then
                ()
            else 
                string-join(functx:remove-elements-deep($target-divs[$n]/tei:head[1], 'note'), '')
        let $ancestor-div-label :=
            if (contains($ancestor-div-label, ':')) then substring-before($ancestor-div-label, ':') else $ancestor-div-label
        let $node-label :=
            if ($target-node/self::tei:note) then 
                concat(if ($n = 1 and $both-notes) then 'footnotes ' else 'footnote ', $target-node/@n)
            else
                (: paragraph-like-block-number :)
                concat(if ($one-note) then 'para ' else if ($n = 1) then 'paras ' else '', index-of($target-start-node-ancestor-div/*[not(self::tei:head)][not(self::tei:byline)][not(self::tei:p[@rend='sectiontitlebold'])], $target-node/ancestor::element()[parent::tei:div][1]))
        return
            string-join(($ancestor-div-label, $node-label), ' ')
    let $label :=
        replace(string-join($target-node-labels, '–'), 'Chapter', 'Ch.')
    let $target-node-destination-hash := 
        if ($target-start-node/self::tei:note) then
            concat('#fnref', substring-after($target-start-node/@xml:id, 'fn'))
        else
            concat('#', $range-start)
    return
        (: check to make sure the targets exist :)
        if ($target-start-node and $target-end-node) then
            element a { 
                attribute href { concat($abs-site-uri, $volume, '/', $target-start-node-ancestor-div/@xml:id, $target-node-destination-hash, $persistent-view) },
                $label 
                }
        (: display the label in case of malformed links :)
        else
            $label
(: handle single point references, as found in the frus-history, e.g.,
    index entries like:
     <term>Woodford, Stewart</term>, <ref target="#b_803">98</ref>
    point to:
     <anchor xml:id="b_611"/>
:)
else if (starts-with($target, '#b')) then
    let $url := substring-after($target, '#')
    let $target-node := root($node)/id($url)
    let $target-node := if ($target-node/ancestor::tei:note) then $target-node/ancestor::tei:note else $target-node
    let $destination-div := $target-node/ancestor::tei:div[1]
    (: use the ancestor chapter div's heading, e.g., "Chapter 9: ...", but chop off at the colon :)
    let $head := string-join(functx:remove-elements-deep($destination-div/tei:head[1], 'note'), '')
    let $target-node-label :=
        if ($target-node/self::tei:note) then 
            concat('footnote ', $target-node/@n)
        else
            concat('para ', index-of($destination-div/*[not(self::tei:head)][not(self::tei:byline)][not(self::tei:p[@rend='sectiontitlebold'])], $target-node/ancestor::element()[parent::tei:div][1]))
    let $label := replace(concat(if (contains($head, ':')) then substring-before($head, ':') else $head, ' ', $target-node-label), 'Chapter', 'Ch.')
    let $target-node-destination-hash := 
        if ($target-node/self::tei:note) then
            concat('#fnref', substring-after($target-node/@xml:id, 'fn'))
        else
            $target
    return
        if ($target-node) then 
            element a { 
                attribute href { concat($abs-site-uri, $volume, '/', $destination-div/@xml:id, $target-node-destination-hash, $persistent-view) },
                $label 
                }
        (: display the label in case of malformed links :)
        else 
            $label
else
    element a { 
        attribute href { concat($abs-site-uri, $volume, '/', substring-after($target, '#'), $persistent-view) }, 
        $type,
        render:recurse($node, $options) 
        }

We should research the logs to find the source of the error messages above, and, if needed, adapt the original link parsing code to our current ODD-based method for transforming TEI into HTML.

joewiz commented 4 years ago

@windauer also commented in chat based on his observations of the logs:

The later part provides us a pointer in the database: /exist/apps/hsg-shell/historicaldocuments/frus-history/b_806

When opening hsg-prod-backend1.hsg:8080/exist/apps/hsg-shell/historicaldocuments/frus-history/b_806 the URL is rewritten to http://hsg-prod-backend1.hsg:8080/exist/apps/hsg-shell/historicaldocuments/frus-history/chapter-5#b_806, the page open perfectly fine in the browser and I see all the mentioned errors in the logs. My current suspicion is that there is a mechanism in place that trieds to resolve /frus-history/b_806 in different ways and if one fails it tries another. As said from an end user point of view everything works fine but the error messages are anoying in the logfile and if you are fine with it Joe I’d like to create a ticket for this one to get rid of the errors after the holidays

joewiz commented 4 years ago

The ebook edition was produced using the old code. Here's a screenshot:

Screen Shot 2019-12-21 at 12 54 35 AM