ContentMine / norma

Convert XML/SVG/PDF into normalised, sectioned, scholarly HTML
Apache License 2.0
36 stars 21 forks source link

Recoverable error warnings #22

Closed blahah closed 8 years ago

blahah commented 8 years ago

When running over a large number of files I see this a lot:

Recoverable error
  XTRE0540: Ambiguous rule match for /article/body[1]/sec[3]/sec[3]/title[1]
Matches both "*[local-name()='sec' and
  *[local-name()='title']]/*[local-name()='sec']/*[local-name()='title']" on line -1 of
and "*[local-name()='alternatives'] | *[local-name()='article'] | *[local-name()='back'] |
  *[local-name()='caption'] | *[local-name()='col'] | *[local-name()='colgroup'] |
  *[local-name()='etal'] | *[local-name()='fig'] | *[local-name()='fn'] |
  *[local-name()='graphic'] | *[local-name()='hr'] | *[local-name()='label'] |
  *[local-name()='name'] | *[local-name()='p'] | *[local-name()='sc'] |
  *[local-name()='sub'] | *[local-name()='sup'] | *[local-name()='table'] |
  *[local-name()='table-wrap'] | *[local-name()='table-wrap-foot'] | *[local-name()='tbody']
  | *[local-name()='td'] | *[local-name()='tfoot'] | *[local-name()='th'] |
  *[local-name()='title'] | *[local-name()='th'] | *[local-name()='thead'] |
  *[local-name()='tr']" on line -1 of

It's a very verbose message that means nothing to me as the user - I suggest we suppress it. We should only show messages that help the user understand what to do - and if things are mostly fine we should stay silent.

tarrow commented 8 years ago

This is definitely the case. I'll be putting a new release out tomorrow or the next day which suppresses these parsing errors.

On Mon, Mar 28, 2016 at 9:15 AM, Richard Smith-Unna < notifications@github.com> wrote:

When running over a large number of files I see this a lot:

Recoverable error XTRE0540: Ambiguous rule match for /article/body[1]/sec[3]/sec[3]/title[1] Matches both "[local-name()='sec' and *[local-name()='title']]/[local-name()='sec']/[local-name()='title']" on line -1 of and "[local-name()='alternatives'] | [local-name()='article'] | [local-name()='back'] | [local-name()='caption'] | [local-name()='col'] | [local-name()='colgroup'] | [local-name()='etal'] | [local-name()='fig'] | [local-name()='fn'] | [local-name()='graphic'] | [local-name()='hr'] | [local-name()='label'] | [local-name()='name'] | [local-name()='p'] | [local-name()='sc'] | [local-name()='sub'] | [local-name()='sup'] | [local-name()='table'] | [local-name()='table-wrap'] | [local-name()='table-wrap-foot'] | [local-name()='tbody'] | [local-name()='td'] | [local-name()='tfoot'] | [local-name()='th'] | [local-name()='title'] | [local-name()='th'] | [local-name()='thead'] | *[local-name()='tr']" on line -1 of

It's a very verbose message that means nothing to me as the user - I suggest we suppress it. We should only show messages that help the user understand what to do - and if things are mostly fine we should stay silent.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ContentMine/norma/issues/22

petermr commented 8 years ago

On Mon, Mar 28, 2016 at 9:15 AM, Richard Smith-Unna < notifications@github.com> wrote:

When running over a large number of files I see this a lot:

Recoverable error XTRE0540: Ambiguous rule match for /article/body[1]/sec[3]/sec[3]/title[1] Matches both "[local-name()='sec' and *[local-name()='title']]/[local-name()='sec']/[local-name()='title']" on line -1 of and "[local-name()='alternatives'] | [local-name()='article'] | [local-name()='back'] | [local-name()='caption'] | [local-name()='col'] | [local-name()='colgroup'] | [local-name()='etal'] | [local-name()='fig'] | [local-name()='fn'] | [local-name()='graphic'] | [local-name()='hr'] | [local-name()='label'] | [local-name()='name'] | [local-name()='p'] | [local-name()='sc'] | [local-name()='sub'] | [local-name()='sup'] | [local-name()='table'] | [local-name()='table-wrap'] | [local-name()='table-wrap-foot'] | [local-name()='tbody'] | [local-name()='td'] | [local-name()='tfoot'] | [local-name()='th'] | [local-name()='title'] | [local-name()='th'] | [local-name()='thead'] | *[local-name()='tr']" on line -1 of

It's a very verbose message that means nothing to me as the user - I suggest we suppress it. We should only show messages that help the user understand what to do - and if things are mostly fine we should stay silent.

It isn't me that outputs it, it's Xerces. This is a well known problem - that some Xerces routines output directly to stderr and they are difficult to suppress.

The main cause is stylesheets which can be interpreted ambiguously. The cure is to write unambiguous stylesheets.

However I am moving to alternative procedural code rather than stylesheets as it is a lot faster and more flexible and this sort of problem does not occur.

However we are at the mercy of the conformity of the JATS XML we receive. The message comes when the creators have omitted the namespace or got it wrong. I have done ca 1000 files and think I have tamed most of the problem

You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ContentMine/norma/issues/22

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

blahah commented 8 years ago

Ah, annoying that Xerxes behaves so badly!

Could we wrap it in something that captures stdout while Xerxes runs, then sets it back to normal afterwards, like this:

http://stackoverflow.com/a/8708357/835428

blahah commented 8 years ago

(or stderr, whichever is the problem)

petermr commented 8 years ago

should be possible to switch on stderr capture at start of Xerces and switch off afterwards. However we don't call Xerces directly, but call the Javax Transformer library which wraps the parser and transformer. In principle those could differ from system to system.

Another approach would be to discard Javax and use Saxon. That's been very well designed and is Open Source and more powerful.

tarrow commented 8 years ago

This is fixed as of the latest release