CERNDocumentServer / harvesting-kit

A kit containing various utilities and scripts related to content harvesting used in Invenio Software (http://invenio-software.org) instances such as INSPIRE (http://inspirehep.net) and SCOAP3 (http://scoap3.org)
GNU General Public License v2.0
7 stars 18 forks source link

HarvestingKit: fix escaping #156

Closed tsgit closed 8 years ago

tsgit commented 8 years ago
* ensure tags end with word boundary, otherwise substrings
  might be mistaken for tags

Signed-off-by: Thorsten Schwander thorsten.schwander@gmail.com

tsgit commented 8 years ago

This addresses a problem with aps harvest where the abstract contains x<min{...} where min is the "minimum" and not a mathml tag <mi>

In [51]: escape_for_xml(s.unescape(s.get_data()), tags_to_keep=s.mathml_elements)
Out[51]: u'Charged rotating Kerr-Newman black holes are known to be\nsuperradiantly unstable to perturbations of charged massive bosonic fields\nwhose proper frequencies lie in the bounded regime 0&lt;\u03c9<min{\u03c9c\u2261m\u03a9H+q\u03a6H,\u03bc}, where

vs.

In [52]: escape_for_xml(s.unescape(s.get_data()), tags_to_keep=None)
Out[52]: u'Charged rotating Kerr-Newman black holes are known to be\nsuperradiantly unstable to perturbations of charged massive bosonic fields\nwhose proper frequencies lie in the bounded regime 0&lt;\u03c9&lt;min{\u03c9c\u2261m\u03a9H+q\u03a6H,\u03bc}, where

you see how the former has <min{..}, while the latter correctly has &lt;min{...}

coveralls commented 8 years ago

Coverage Status

Coverage remained the same at 67.469% when pulling 22c75525d049944733940d47004a9006ef84fb0c on tsgit:xml-escape-fix into 7580b7cfbaf2563770c40e6fd27c2fed45f02b11 on inspirehep:master.

tsgit commented 8 years ago

mmh, I wonder about the regexp overall

In [72]: r"(<)(?!([\/]?{0})\b)".format("|[\/]?".join(s.mathml_elements))
Out[72]: '(<)(?!([\\/]?mroot|[\\/]?mstyle|[\\/]?mover|[\\/]?msubsup|[\\/]?mrow|[\\/]?annotation-xml|[\\/]?mphantom|[\\/]?mmultiscripts|[\\/]?msqrt|[\\/]?annotation|[\\/]?mpadded|[\\/]?mtable|[\\/]?munder|[\\/]?math|[\\/]?mfenced|[\\/]?mspace|[\\/]?msup|[\\/]?mfrac|[\\/]?munderover|[\\/]?mtext|[\\/]?msub|[\\/]?semantics|[\\/]?none|[\\/]?merror|[\\/]?mtr|[\\/]?mo|[\\/]?mn|[\\/]?mi|[\\/]?maction|[\\/]?mprescripts|[\\/]?mtd)\\b)'

I think this is equivalent and more efficient?

In [73]: r"(<)(?![\/]?({0})\b)".format("|".join(s.mathml_elements))
Out[73]: '(<)(?![\\/]?(mroot|mstyle|mover|msubsup|mrow|annotation-xml|mphantom|mmultiscripts|msqrt|annotation|mpadded|mtable|munder|math|mfenced|mspace|msup|mfrac|munderover|mtext|msub|semantics|none|merror|mtr|mo|mn|mi|maction|mprescripts|mtd)\\b)'
kaplun commented 8 years ago

I agree with you that it looks equivalent and simpler. Is this the reason why the APS harvesting is currently crashing? Thanks a lot for having looked into this!

tsgit commented 8 years ago

yes, @kaplun it's the reason for the aps harvest error.

what happens is that aps-fulltext -> marcxml -> bibrecord(marcxml) returns None because bibrecord fails due to unescaped '<' in the marcxml and then subsequent steps throw exception about Nonetype not iterable

coveralls commented 8 years ago

Coverage Status

Coverage remained the same at 67.469% when pulling 4de60048b8abc11e5f4729517fd4dba3aa40216a on tsgit:xml-escape-fix into 7580b7cfbaf2563770c40e6fd27c2fed45f02b11 on inspirehep:master.

tsgit commented 8 years ago

python /usr/lib/python2.6/site-packages/harvestingkit/aps_package.py /afs/cern.ch/project/inspire/uploads/aps/2016.08.23-2/apsharvest_unzip_xT7IcX/articlebag-10-1103-PhysRevD-94-044036-apsxml/data/PhysRevD.94.044036/fulltext.xml > /tmp/brokenaps.xml

is the current faulty output, containing an unescaped '<'

... in the bounded regime 0&lt;ω<min{ωc≡mΩH+qΦH,μ}, where

david-caro commented 8 years ago

Looks good to me too :+1:

kaplun commented 8 years ago

I am making a new release. So that we can include this.