Open petermr opened 8 years ago
Made good progress yesterday for Taylor and Francis (which is full of cruft and repeated text). Here's briesf stylsheet:
<xsl:output method="xhtml"/>
<!--Identity template, strips PIs and comments -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()" />
</xsl:copy>
</xsl:template>
<xsl:template match="comment()" priority="1.0"/>
<!-- delete these sections and snippets -->
<!-- header -->
<xsl:template match="h:div[@id='hd']"/>
<xsl:template match="h:div[@id='cookieBanner']"/>
<xsl:template match="h:div[@id='primarySubjects']"/>
<xsl:template match="h:div[@id='breadcrumb']"/>
<xsl:template match="h:div[@class='gutter' and h:div[contains(@class,'accordianPanel')]]"/>
<xsl:template match="h:div/h:b/h:a[.='Publishing models and article dates explained']"/>
<xsl:template match="h:div[contains(@class,'script_only')]"/>
<xsl:template match="h:div[contains(@class,'access')]"/>
<xsl:template match="h:div[contains(@class,'secondarySubjects')]"/>
<xsl:template match="h:ul[@class='recommend']"/>
<xsl:template match="h:div[@id='unit3']"/>
<xsl:template match="h:h3[.='Related articles']"/>
<xsl:template match="h:a[.='View all related articles']"/>
<xsl:template match="h:div[contains(@class,'social')]"/>
<xsl:template match="h:ul[contains(@class,'tabsNav')]"/>
<xsl:template match="h:div[@id='siteInfo']"/>
<xsl:template match="h:div[contains(@class,'credits')]"/>
<xsl:template match="h:a[starts-with(.,'[') and ends-with(.,']')]"/>
<xsl:template match="h:a[.='View all references']"/>
<xsl:template match="h:div[normalize-space(.)='']" priority="0.51"/>
That's great - thanks! How do I go about applying this template to the HTML? Is there a method already built-in to one of the ContentMine tools (eg. norma), or do I need to do this separately?
It's built into Norma. I think the production version works. Needs two passes. First to create XHTML, next to strip cruft (and third to normalize XHTML to formal SHTML)
Here's my test:
File targetDir = new File("target/tutorial/tf");
CMineTestFixtures.cleanAndCopyDir(new File("src/test/resources/org/xmlcml/norma/pubstyle/tf/TandF_OA_Test"), targetDir);
String args = "--project "+targetDir+" -i fulltext.html -o fulltext.xhtml --html jsoup";
DefaultArgProcessor argProcessor = new NormaArgProcessor(args);
argProcessor.runAndOutput();
CProject project = new CProject(targetDir);
CTree ctree0 = project.getCTreeList().get(0);
File xhtml = ctree0.getExistingFulltextXHTML();
Assert.assertTrue("xhtml: ", xhtml.exists());
args = "--project "+targetDir+" -i fulltext.xhtml -o scholarly.html --transform tf2html";
argProcessor = new NormaArgProcessor(args);
argProcessor.runAndOutput();
File shtml = ctree0.getExistingScholarlyHTML();
Assert.assertTrue("shtml: ", shtml.exists());
That SHOULD transfer into:
norma --project . -i fulltext.html -o fulltext.xhtml --html jsoup
norma --project . -i fulltext.xhtml -o scholarly.html --transform tf2html
You then need the updated symbol file stylesheetByName.xsl
:
<stylesheetList>
<stylesheet name="bmc2html">/org/xmlcml/norma/pubstyle/bmc/xml2html.xsl</stylesheet>
<stylesheet name="ieee2html">/org/xmlcml/norma/pubstyle/ieee/toHtml.xsl</stylesheet>
<stylesheet name="ncbi-jats2html">/org/xmlcml/norma/pubstyle/nlm/ncbi/jats-html.xsl</stylesheet>
<stylesheet name="nlm2html">/org/xmlcml/norma/pubstyle/nlm/toHtml.xsl</stylesheet>
<stylesheet name="jats2shtml">/org/xmlcml/norma/pubstyle/nlm/jats/jats2shtml.xsl</stylesheet>
<stylesheet name="nature2html">/org/xmlcml/norma/pubstyle/nature/toHtml.xsl</stylesheet>
<stylesheet name="hind2xml">/org/xmlcml/norma/pubstyle/hindawi/groupMajorSections.xsl</stylesheet>
<stylesheet name="tf2html">/org/xmlcml/norma/pubstyle/tf/toHtml.xsl</stylesheet>
<!-- patents -->
<stylesheet name="uspto2html">/org/xmlcml/norma/patents/uspto/toHtml.xsl</stylesheet>
</stylesheetList>
and the stylesheet toHtml.xsl
itself:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:h="http://www.w3.org/1999/xhtml">
<xsl:output method="xhtml"/>
<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>
<!--Identity template, strips PIs and comments -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()" />
</xsl:copy>
</xsl:template>
<xsl:template match="comment()" priority="1.0"/>
<!-- header -->
<xsl:template match="h:div[@id='hd']"/>
<xsl:template match="h:div[@id='cookieBanner']"/>
<xsl:template match="h:div[@id='primarySubjects']"/>
<xsl:template match="h:div[@id='breadcrumb']"/>
<xsl:template match="h:div[@class='gutter' and h:div[contains(@class,'accordianPanel')]]"/>
<xsl:template match="h:div/h:b/h:a[.='Publishing models and article dates explained']"/>
<xsl:template match="h:div[contains(@class,'script_only')]"/>
<xsl:template match="h:div[contains(@class,'access')]"/>
<xsl:template match="h:div[contains(@class,'secondarySubjects')]"/>
<xsl:template match="h:ul[@class='recommend']"/>
<xsl:template match="h:div[@id='unit3']"/>
<xsl:template match="h:h3[.='Related articles']"/>
<xsl:template match="h:a[.='View all related articles']"/>
<xsl:template match="h:div[contains(@class,'social')]"/>
<xsl:template match="h:ul[contains(@class,'tabsNav')]"/>
<xsl:template match="h:div[@id='siteInfo']"/>
<xsl:template match="h:div[contains(@class,'credits')]"/>
<xsl:template match="h:a[starts-with(.,'[') and ends-with(.,']')]"/>
<xsl:template match="h:a[.='View all references']"/>
<xsl:template match="h:a[.='figureViewerArticleInfo']"/>
<xsl:template match="h:div[@class='hidden']"/>
<xsl:template match="h:span[contains(@class,'dropDownAlt')]"/>
<xsl:template match="h:a[contains(@onclick,'showFigures')]"/>
<xsl:template match="h:div[@class='figureDownloadOptions']"/>
<xsl:template match="h:div[normalize-space(.)='']" priority="0.51"/>
</xsl:stylesheet>
see https://github.com/ContentMine/norma/blob/master/docs/TRANSFORM.md - please see if this works and comment. This can be done for other publishers.
Many/most HTML from publishers includes large amounts of material not relevant to the scholarly narrative. These include:
Much of this can be managed by XSLT stylesheets which "snip off" this cruft. I don't think there is a simple way of tackling this - it has to be a per-publisher or per journal solution. That means we need a way of locating and using stylesheets from the commandline.
Ideally we need:
into
I propose XSLT and XPath for the first two. It's possible that the restructuring can also tackle 3; we'd need XSLT2 with Saxon.