ContentMine / norma

Convert XML/SVG/PDF into normalised, sectioned, scholarly HTML
Apache License 2.0
36 stars 21 forks source link

Removing irrelevant cruft from publisher HTML #32

Open petermr opened 8 years ago

petermr commented 8 years ago

Many/most HTML from publishers includes large amounts of material not relevant to the scholarly narrative. These include:

  1. information about the journal
  2. metrics
  3. advertising
  4. links to other resources
  5. over-complex annotation

Much of this can be managed by XSLT stylesheets which "snip off" this cruft. I don't think there is a simple way of tackling this - it has to be a per-publisher or per journal solution. That means we need a way of locating and using stylesheets from the commandline.

Ideally we need:

  1. a means for detecting which publisher/journal has created the document
  2. a means for removing unwanted sections
  3. restructuring (e.g. turning:
<h2>title</h2>
<p>p1</p>
<p>p2</p>

into

<div class="controlled_vocab" title="title">
<p>p1</p>
<p>p2</p>
</div>

I propose XSLT and XPath for the first two. It's possible that the restructuring can also tackle 3; we'd need XSLT2 with Saxon.

petermr commented 8 years ago

Made good progress yesterday for Taylor and Francis (which is full of cruft and repeated text). Here's briesf stylsheet:

    <xsl:output method="xhtml"/>

    <!--Identity template, strips PIs and comments -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>

    <xsl:template match="comment()" priority="1.0"/>

<!-- delete these sections and snippets -->
    <!--  header -->
    <xsl:template match="h:div[@id='hd']"/>
    <xsl:template match="h:div[@id='cookieBanner']"/>
    <xsl:template match="h:div[@id='primarySubjects']"/>
    <xsl:template match="h:div[@id='breadcrumb']"/>
    <xsl:template match="h:div[@class='gutter' and h:div[contains(@class,'accordianPanel')]]"/>
    <xsl:template match="h:div/h:b/h:a[.='Publishing models and article dates explained']"/>
    <xsl:template match="h:div[contains(@class,'script_only')]"/>
    <xsl:template match="h:div[contains(@class,'access')]"/> 
    <xsl:template match="h:div[contains(@class,'secondarySubjects')]"/> 
    <xsl:template match="h:ul[@class='recommend']"/> 
    <xsl:template match="h:div[@id='unit3']"/> 
    <xsl:template match="h:h3[.='Related articles']"/> 
    <xsl:template match="h:a[.='View all related articles']"/> 
    <xsl:template match="h:div[contains(@class,'social')]"/> 
    <xsl:template match="h:ul[contains(@class,'tabsNav')]"/> 
    <xsl:template match="h:div[@id='siteInfo']"/> 
    <xsl:template match="h:div[contains(@class,'credits')]"/> 
    <xsl:template match="h:a[starts-with(.,'[') and ends-with(.,']')]"/> 
    <xsl:template match="h:a[.='View all references']"/> 
    <xsl:template match="h:div[normalize-space(.)='']" priority="0.51"/> 
robintw commented 8 years ago

That's great - thanks! How do I go about applying this template to the HTML? Is there a method already built-in to one of the ContentMine tools (eg. norma), or do I need to do this separately?

petermr commented 8 years ago

It's built into Norma. I think the production version works. Needs two passes. First to create XHTML, next to strip cruft (and third to normalize XHTML to formal SHTML)

petermr commented 8 years ago

Here's my test:

        File targetDir = new File("target/tutorial/tf");
        CMineTestFixtures.cleanAndCopyDir(new File("src/test/resources/org/xmlcml/norma/pubstyle/tf/TandF_OA_Test"), targetDir);
        String args = "--project "+targetDir+" -i fulltext.html -o fulltext.xhtml --html jsoup";
        DefaultArgProcessor argProcessor = new NormaArgProcessor(args); 
        argProcessor.runAndOutput(); 
        CProject project = new CProject(targetDir);
        CTree ctree0 = project.getCTreeList().get(0);
        File xhtml = ctree0.getExistingFulltextXHTML();
        Assert.assertTrue("xhtml: ", xhtml.exists());
        args = "--project "+targetDir+" -i fulltext.xhtml -o scholarly.html --transform tf2html";
        argProcessor = new NormaArgProcessor(args); 
        argProcessor.runAndOutput(); 
        File shtml = ctree0.getExistingScholarlyHTML();
        Assert.assertTrue("shtml: ", shtml.exists());

That SHOULD transfer into:

    norma --project . -i fulltext.html -o fulltext.xhtml --html jsoup
    norma --project . -i fulltext.xhtml -o scholarly.html --transform tf2html

You then need the updated symbol file stylesheetByName.xsl:

<stylesheetList>
  <stylesheet name="bmc2html">/org/xmlcml/norma/pubstyle/bmc/xml2html.xsl</stylesheet>
  <stylesheet name="ieee2html">/org/xmlcml/norma/pubstyle/ieee/toHtml.xsl</stylesheet>
  <stylesheet name="ncbi-jats2html">/org/xmlcml/norma/pubstyle/nlm/ncbi/jats-html.xsl</stylesheet>
  <stylesheet name="nlm2html">/org/xmlcml/norma/pubstyle/nlm/toHtml.xsl</stylesheet>
  <stylesheet name="jats2shtml">/org/xmlcml/norma/pubstyle/nlm/jats/jats2shtml.xsl</stylesheet>
  <stylesheet name="nature2html">/org/xmlcml/norma/pubstyle/nature/toHtml.xsl</stylesheet>
  <stylesheet name="hind2xml">/org/xmlcml/norma/pubstyle/hindawi/groupMajorSections.xsl</stylesheet>
  <stylesheet name="tf2html">/org/xmlcml/norma/pubstyle/tf/toHtml.xsl</stylesheet>
  <!--  patents  -->
  <stylesheet name="uspto2html">/org/xmlcml/norma/patents/uspto/toHtml.xsl</stylesheet>
</stylesheetList>

and the stylesheet toHtml.xsl itself:

<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:h="http://www.w3.org/1999/xhtml">

    <xsl:output method="xhtml"/>

    <xsl:template match="/">
        <xsl:apply-templates />
    </xsl:template>

    <!--Identity template, strips PIs and comments -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>

    <xsl:template match="comment()" priority="1.0"/>

    <!--  header -->
    <xsl:template match="h:div[@id='hd']"/>
    <xsl:template match="h:div[@id='cookieBanner']"/>
    <xsl:template match="h:div[@id='primarySubjects']"/>
    <xsl:template match="h:div[@id='breadcrumb']"/>
    <xsl:template match="h:div[@class='gutter' and h:div[contains(@class,'accordianPanel')]]"/>

    <xsl:template match="h:div/h:b/h:a[.='Publishing models and article dates explained']"/>
    <xsl:template match="h:div[contains(@class,'script_only')]"/>
    <xsl:template match="h:div[contains(@class,'access')]"/> 
    <xsl:template match="h:div[contains(@class,'secondarySubjects')]"/> 

    <xsl:template match="h:ul[@class='recommend']"/> 
    <xsl:template match="h:div[@id='unit3']"/> 
    <xsl:template match="h:h3[.='Related articles']"/> 
    <xsl:template match="h:a[.='View all related articles']"/> 
    <xsl:template match="h:div[contains(@class,'social')]"/> 
    <xsl:template match="h:ul[contains(@class,'tabsNav')]"/> 
    <xsl:template match="h:div[@id='siteInfo']"/> 
    <xsl:template match="h:div[contains(@class,'credits')]"/> 
    <xsl:template match="h:a[starts-with(.,'[') and ends-with(.,']')]"/> 
    <xsl:template match="h:a[.='View all references']"/> 
    <xsl:template match="h:a[.='figureViewerArticleInfo']"/> 
    <xsl:template match="h:div[@class='hidden']"/> 
    <xsl:template match="h:span[contains(@class,'dropDownAlt')]"/> 
    <xsl:template match="h:a[contains(@onclick,'showFigures')]"/> 
    <xsl:template match="h:div[@class='figureDownloadOptions']"/> 
    <xsl:template match="h:div[normalize-space(.)='']" priority="0.51"/> 

</xsl:stylesheet>
petermr commented 8 years ago

see https://github.com/ContentMine/norma/blob/master/docs/TRANSFORM.md - please see if this works and comment. This can be done for other publishers.