Closed tsgit closed 8 years ago
This addresses a problem with aps harvest where the abstract contains x<min{...}
where min
is the "minimum" and not a mathml tag <mi>
In [51]: escape_for_xml(s.unescape(s.get_data()), tags_to_keep=s.mathml_elements)
Out[51]: u'Charged rotating Kerr-Newman black holes are known to be\nsuperradiantly unstable to perturbations of charged massive bosonic fields\nwhose proper frequencies lie in the bounded regime 0<\u03c9<min{\u03c9c\u2261m\u03a9H+q\u03a6H,\u03bc}, where
vs.
In [52]: escape_for_xml(s.unescape(s.get_data()), tags_to_keep=None)
Out[52]: u'Charged rotating Kerr-Newman black holes are known to be\nsuperradiantly unstable to perturbations of charged massive bosonic fields\nwhose proper frequencies lie in the bounded regime 0<\u03c9<min{\u03c9c\u2261m\u03a9H+q\u03a6H,\u03bc}, where
you see how the former has <min{..}
, while the latter correctly has <min{...}
mmh, I wonder about the regexp overall
In [72]: r"(<)(?!([\/]?{0})\b)".format("|[\/]?".join(s.mathml_elements))
Out[72]: '(<)(?!([\\/]?mroot|[\\/]?mstyle|[\\/]?mover|[\\/]?msubsup|[\\/]?mrow|[\\/]?annotation-xml|[\\/]?mphantom|[\\/]?mmultiscripts|[\\/]?msqrt|[\\/]?annotation|[\\/]?mpadded|[\\/]?mtable|[\\/]?munder|[\\/]?math|[\\/]?mfenced|[\\/]?mspace|[\\/]?msup|[\\/]?mfrac|[\\/]?munderover|[\\/]?mtext|[\\/]?msub|[\\/]?semantics|[\\/]?none|[\\/]?merror|[\\/]?mtr|[\\/]?mo|[\\/]?mn|[\\/]?mi|[\\/]?maction|[\\/]?mprescripts|[\\/]?mtd)\\b)'
I think this is equivalent and more efficient?
In [73]: r"(<)(?![\/]?({0})\b)".format("|".join(s.mathml_elements))
Out[73]: '(<)(?![\\/]?(mroot|mstyle|mover|msubsup|mrow|annotation-xml|mphantom|mmultiscripts|msqrt|annotation|mpadded|mtable|munder|math|mfenced|mspace|msup|mfrac|munderover|mtext|msub|semantics|none|merror|mtr|mo|mn|mi|maction|mprescripts|mtd)\\b)'
I agree with you that it looks equivalent and simpler. Is this the reason why the APS harvesting is currently crashing? Thanks a lot for having looked into this!
yes, @kaplun it's the reason for the aps harvest error.
what happens is that aps-fulltext -> marcxml -> bibrecord(marcxml) returns None because bibrecord fails due to unescaped '<' in the marcxml and then subsequent steps throw exception about Nonetype not iterable
python /usr/lib/python2.6/site-packages/harvestingkit/aps_package.py /afs/cern.ch/project/inspire/uploads/aps/2016.08.23-2/apsharvest_unzip_xT7IcX/articlebag-10-1103-PhysRevD-94-044036-apsxml/data/PhysRevD.94.044036/fulltext.xml > /tmp/brokenaps.xml
is the current faulty output, containing an unescaped '<'
... in the bounded regime 0<ω<min{ωc≡mΩH+qΦH,μ}, where
Looks good to me too :+1:
I am making a new release. So that we can include this.
Signed-off-by: Thorsten Schwander thorsten.schwander@gmail.com