IUBLibTech / newton_chymistry

New version of 'The Chymistry of Isaac Newton', using XProc pipelines to generate a website based on TEI XML encodings of Newton's alchemical manuscripts, and Apache Solr as a search engine.
2 stars 0 forks source link

SiteImprove Accessibility Screening Causing Newton and other Library Apps to Crash #143

Closed mdalmau closed 5 months ago

mdalmau commented 6 months ago

@randalldfloyd : I can't find the email or maybe Slack (I searched both) in which you describe what's happening with the XProc structure and SiteImprove that causes Newton and library apps using shared Newton services to crash. When you get a moment, could you document that here, please? Thanks!

randalldfloyd commented 6 months ago

Overview

There's a bug handling the results pagination that causes the page number links to recursively contain the page number parameters of every page previously visited.

Why is this a problem?

Because the parameter list continues to grow as pagination links are followed, it makes them look like links to new unique pages even though they are the same ones likely already visited. To a real user this has no effect, but to a crawler they are considered pages that have not been crawled yet, so it tries to follow them all and gets into a never-ending loop.

For example, if you're on page 2, the links rendered for pages 3 and 4 include page=2, so they become:

https://newton.dlib.indiana.edu/search/?page=3&page=2
https://newton.dlib.indiana.edu/search/?page=4&page=2

Now when you go to page 3, the link rendered back to page 1 now includes all of the previously visited pages :

https://newton.dlib.indiana.edu/search/?page=1&page=3&page=2

And so on and so on...

Examples:

The Siteimprove crawler had to be stopped because it got stuck continuously crawling the pagination links:

/search/?page=4&page=3&page=2&page=4&page=2&page=3&page=2&page=1&page=2

The Semrush crawler had to be denied in a robots.txt because it was really deep into recursive page number links:

page=1&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2
&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2
&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2&page=2
&page=3&page=3&page=3&page=3&page=3&page=3&page=3&page=3&page=3&page=3&page=3&page=3
&page=3&page=3&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4
&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4
&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4
&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4
&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4&page=4 
tubesoft commented 6 months ago

It needs more investigation, but I found the code dealing with pagination. solr-response-to-html.xsl line 95-157:

    <xsl:template name="render-pagination-links">
        <xsl:variable name="current-page" select="
            xs:integer(
                (
                    $request/c:param[@name='page']/@value,
                    1
                )[1]
            )
        "/>
        <xsl:variable name="last-page" select="
            xs:integer(
                1 + ($response/f:map/f:map[@key='response']/f:number[@key='numFound'] - 1) idiv $default-results-limit
            )
        "/>
        <xsl:variable name="search-field-url-parameters" select="
            string-join(
                (
                    for $parameter
                    in $request/c:param
                        [normalize-space(@value)]
                    return concat(
                        encode-for-uri($parameter/@name), '=', encode-for-uri($parameter/@value)
                    )
                ),
                '&amp;'
            )
        "/>
        <xsl:if test="$last-page &gt; 1">
            <!-- there are multiple pages of results -->
            <nav class="pagination">
                <span>Page: </span>
                <xsl:if test="$current-page &gt; 1">
                    <xsl:for-each select="1 to $current-page - 1">
                        <a class="page" href="{
                            string-join(
                                (
                                    concat($search-base-url, '?page=', .),
                                    $search-field-url-parameters
                                ),
                                '&amp;'
                            )
                        }"><xsl:value-of select="."/></a>
                        <xsl:text> </xsl:text>
                    </xsl:for-each>
                </xsl:if>
                <span class="page"><xsl:value-of select="$current-page"/></span>
                <xsl:if test="$current-page &lt; $last-page">
                    <xsl:for-each select="$current-page + 1 to $last-page">
                        <xsl:text> </xsl:text>
                        <a class="page" href="{
                            string-join(
                                (
                                    concat($search-base-url, '?page=', .),
                                    $search-field-url-parameters
                                ),
                                '&amp;'
                            )
                        }"><xsl:value-of select="."/></a>
                    </xsl:for-each>
                </xsl:if>
            </nav>
        </xsl:if>
    </xsl:template>
tubesoft commented 6 months ago

@mdalmau @randalldfloyd It is especially due to this part:

            <nav class="pagination">
                <span>Page: </span>
                <xsl:if test="$current-page &gt; 1">
                    <xsl:for-each select="1 to $current-page - 1">
                        <a class="page" href="{
                            string-join(
                                (
                                    concat($search-base-url, '?page=', .),
                                    $search-field-url-parameters
                                ),
                                '&amp;'
                            )
                        }"><xsl:value-of select="."/></a>
                        <xsl:text> </xsl:text>
                    </xsl:for-each>
                </xsl:if>
                <span class="page"><xsl:value-of select="$current-page"/></span>
                <xsl:if test="$current-page &lt; $last-page">
                    <xsl:for-each select="$current-page + 1 to $last-page">
                        <xsl:text> </xsl:text>
                        <a class="page" href="{
                            string-join(
                                (
                                    concat($search-base-url, '?page=', .),
                                    $search-field-url-parameters
                                ),
                                '&amp;'
                            )
                        }"><xsl:value-of select="."/></a>
                    </xsl:for-each>
                </xsl:if>
            </nav>

More specifically,

<a class="page" href="{
        string-join(
            (
                concat($search-base-url, '?page=', .),
                $search-field-url-parameters
            ),
            '&amp;'
        )
    }"><xsl:value-of select="."/></a>

This code is making a problematic URL. I changed like below in my local environment, and it seems to behave okay, though it is necessary to test more.

<xsl:for-each select="1 to $current-page - 1">
    <a class="page" href="{
        string-join(
            concat($search-base-url, '?page=', .),
            '&amp;'
        )
    }"><xsl:value-of select="."/></a>
tubesoft commented 5 months ago

@mdalmau @randalldfloyd Can I include the change above in the next pull request? Other issues of the SiteImprove requirements are fixed and almost ready to issue a pull request.

randalldfloyd commented 5 months ago

@tubesoft Yeah that seems reasonable. When merged, it will get deployed to the devel instance where it can be tested more with all of the documents in place.

mdalmau commented 5 months ago

@tubesoft OMG you may have fixed this! WOW!! Since this issue is now part of PR https://github.com/IUBLibTech/newton_chymistry/pull/146, I will mark this issue as "done" so I can better see the remaining accessibility issues! But we will really need to test and I don't think we can until we actually push to production, right @randalldfloyd ?