ContentMine / norma

Convert XML/SVG/PDF into normalised, sectioned, scholarly HTML
Apache License 2.0
36 stars 21 forks source link

Can norma convert from fulltext.html to scholarly.html? #23

Open robintw opened 8 years ago

robintw commented 8 years ago

I've been using norma quite effectively with an OA journal that provides nice fulltext.xml files for each of its articles.

However, I'm now trying to work with a non-OA journal which doesn't provide XML files, and so I've just got a fulltext.html file and a fulltext.pdf file.

I haven't managed to find a proper list/description of the transforms that can be used with norma (maybe I haven't looked in the right place), but is there any transform that will help me convert me HTML or PDF to scholarly.html files?

petermr commented 8 years ago

Thanks Robin for interest and perseverance. YES, we can usually convert from HTML to scholarly html. See: http://discuss.contentmine.org/t/overall-architecture/142/2 (slide 8/9). The "HTML" created by publishers is hugely variable. Some is excellent semantic XHTML (balanced tags), some is reasonably conformant HTML5. We can use standard tools to convert both of these happily and they are in the norma transforms. The problem is with

(a) HTML + JS. Some "HTML" is mainly JS which loads pages dynamically. So we have to use a headless browser , often recursively. The "DOM" that results is not an XHTML Dom but may be tractable. (b) really bad HTML. We have found smart quotes instead of quote, for example. meaningless namespaces, etc.

I have linked 3 Java tidiers into norma - JTidy, JSoup, and HtmlUnit. I am not sure all these are supported. It also makes sense to use tidiers in any other language. Hopefully we can come up with per-publisher hacks.

Other problems are gifs for high-code points, especially in maths, and horrors such as "/" overstriking "=" for not-equals. But they are gradually getting fewer.

Once you have transformed to HTML it may well need reprocessing. If end tags are missed we may get badly nested lists, for example. A single "

" is sometimes used as a paragraph separator.

Then it is often unclear where sections start and end. For example:

<h2>big title</h2>
<p>stuff</p>
<p>more stuff</p>
<h3>little title</h3>
<p>yet more stuff</p>
<h2>second big title</h2>
<p>more stuff</p>

This displays to the human eye as sections (just!) but has to be gathered into machine sections.

I have written lots of code to do this, and so have others.

robintw commented 8 years ago

Thanks @petermr that's really useful to know.

I'm currently trying to convert the fulltext.html from http://www.tandfonline.com/doi/full/10.1080/01431161.2014.971469#abstract to scholarly.html. Do you have any suggestions as to which transforms I should start with?

I've tried running some of the 'cleaning' transforms that you mentioned (jsoup, jtidy etc), but I get the following output for all of them:

norma -q . -i fulltext.pdf -o scholarly.html --transform jsoup 0 [main] ERROR org.xmlcml.norma.NormaTransformer - no transforms given/parsed

Once I have clean HTML (which, as you say, I could get using other tools), how should I run norma to get scholarly.html?

petermr commented 8 years ago

norma -q . -i fulltext.pdf -o scholarly.html --transform jsoup 0 [main] ERROR

are you starting with PDF or HTML? If html you should use one of the three I mentioned. If PDF the transform should be pdf2txt with TXT output.

org.xmlcml.norma.NormaTransformer - no transforms given/parsed ^^ this is a spurious error - ignore it. I thought we had removed it.

list what you want to do and I'll see what the current code will do.

robintw commented 8 years ago

Thanks.

I'm starting with fulltext.html (from the URL I gave in my last message) and trying to get scholarly.html.

When running this: norma -q . -i fulltext.html -o scholarly.html --transform jsoup, I get the output:

0    [main] ERROR org.xmlcml.norma.NormaTransformer  - no transforms given/parsed
.

I thought the . was a sort of progress bar saying norma has processed something - but there is no scholarly.html in the output directory. There is also nothing in the logs in the target directory - is there a way to turn on some sort of verbose/debug logging?

I've tried with --transform set to jsoup, jtidy and htmlunit and I get the same results for all of them: the error about no transforms given and no scholarly.html output.

robintw commented 8 years ago

Has there been any progress on this?

I'd love to be able to take fulltext.html (from http://www.tandfonline.com/doi/full/10.1080/01431161.2014.971469#abstract) and convert it to scholarly.html.

petermr commented 8 years ago

Suggest that you work with the development branch of norma. Tom is in charge of this and you and I can branch and make pull requests.

The message no transforms given/parsed is misleading and I think has been withdrawn.

If you can commit a (failing) test with a CProject of (say) 2 CTrees at most then we can jointly debug this. I think it's a general problem - not just T+F . I am keen we get norma cleaned up!

P.

On Fri, Apr 29, 2016 at 10:01 AM, Robin Wilson notifications@github.com wrote:

Has there been any progress on this?

I'd love to be able to take fulltext.html (from http://www.tandfonline.com/doi/full/10.1080/01431161.2014.971469#abstract) and convert it to scholarly.html.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ContentMine/norma/issues/23#issuecomment-215663497

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

tarrow commented 8 years ago

The error message still happens when it can't find a transform or a stylesheet. It just used to also happen when it was only passed a stylesheet and no transform (these internals are a little bit unclear to an outsider)

I'm not sure that @robintw has built from source and in any case he (and we) probably can't distribute CTrees containing closed access papers. If you could make a test that fails (using JUnit) from an open source that would be fairly awesome but if you're not a java developer then don't sweat it.

I can always try and replicate the bug on the open literature so we can distribute a test.

petermr commented 8 years ago

I think we can find some Open Access T+F papers...

On Fri, Apr 29, 2016 at 4:54 PM, tarrow notifications@github.com wrote:

The error message still happens when it can't find a transform or a stylesheet. It just used to also happen when it was only passed a stylesheet and no transform (these internals are a little bit unclear to an outsider)

I'm not sure that @robintw https://github.com/robintw has built from source and in any case he (and we) probably can't distribute CTrees containing closed access papers. If you could make a test that fails (using JUnit) from an open source that would be fairly awesome but if you're not a java developer then don't sweat it.

I can always try and replicate the bug on the open literature so we can distribute a test.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ContentMine/norma/issues/23#issuecomment-215774333

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

robintw commented 8 years ago

TandF_OA_Examples.tar.gz is a compressed directory containing two open-access papers scraped from the T&F journal the International Journal of Remote Sensing. It should work as an example for the problems I'm having.

I have tried running norma -q . -i fulltext.html -o scholarly.html --transform jsoup in that directory (this may or may not be the right command to try and run), and I get the output:

0    [main] ERROR org.xmlcml.norma.NormaTransformer  - no transforms given/parsed
.%         

I can manage little bits of Java programming if necessary, but I don't have any experience of JUnit and would rather not get too far into that...!

petermr commented 8 years ago

@robintw The commands are badly documented. I tested:

        File targetDir = new File("target/tutorial/tf");
        CMineTestFixtures.cleanAndCopyDir(new File("src/test/resources/org/xmlcml/norma/pubstyle/tf/TandF_OA_Test"), targetDir);
        String args = "--project "+targetDir+" -i fulltext.html -o scholarly.html --html jsoup";
        DefaultArgProcessor argProcessor = new NormaArgProcessor(args); 
        argProcessor.runAndOutput(); 

and got an output: so try:

norma -q . -i fulltext.html -o scholarly.html --html jsoup

This is a mess. I think we should bring html2scholarly back to --transform and allow what you wrote.

petermr commented 8 years ago

Note, the the result is nowhere near scholarlyHtml - it's a mess because it needs structuring. But at least it's well-formed.

robintw commented 8 years ago

Thanks a lot - it's definitely progress to be able to get it to run!

As you probably saw, it gives an error saying .!1 [main] ERROR org.xmlcml.norma.NormaArgProcessor - Cannot parse HTML, but at least it produces the output.

I had a look at the output and found that it wasn't very close to scholarly HTML! What is the best way forward to try and improve the structuring of the output to get it as close to 'proper' scholarly HTML as possible? I'm happy to have a play with bits of code if you can point me in the right direction.

petermr commented 8 years ago

I think it needs an XSL stylesheet to snip off the useless publisher stuff. It's tedious rather than difficult. It's relatively easy if you know it;s a TF paper - harder to recognize that it is one in a mixed bunch. How keen are you for this? I can give you a start and suggest you add the details. It's not hard once I have started this. I also need to think of the command to do it... more tomorrow.

petermr commented 8 years ago

TIDY html. I think we are going to have to have three or 4 passes for normalizing HTML.

  1. turn non-wellformed HTML into wellformed using standard tools (tidy, htmlunit, jsoup, etc.)
  2. remove publisher cruft - scripts, advertising, publisher self-promotion, metrics
  3. create a structured document (e.g. nested divs )
  4. tag sections and create conformant ScholarlyHTML

I think we should have a separate issue for this... and will create it.