ContentMine / ami

Apache License 2.0
13 stars 14 forks source link

ami2-regex: XPath not working #73

Open larsgw opened 7 years ago

larsgw commented 7 years ago

XPaths in ami2-regex results are all the same, while the actual matches aren't. It seems to be the XPath of the last match. Now that I'm looking at things: the last match doesn't actually have an XPath in the results.xml.

Software versions

$ getpapers -V
0.4.14

$ norma --version
norma(0.3.1)
norma(0.3.1)

$ ami2-regex --version
regex(null)
regex(null)

# (from the 0.2.24 .deb release)

Steps

$ mkdir tmp
$ getpapers -q PMCID:PMC4833924 -x -o tmp
$ norma -p . -i fulltext.xml -o scholarly.html --transform nlm2html
$ ami2-regex --project tmp --context 25 25 -i scholarly.html --r.regex regex.xml

regex.xml

<compoundRegex title="jrc">
  <regex fields="jrc">NM[-]?\d\d\d</regex>
</compoundRegex>

Output

PMC4833924/results/regex/jrc/results.xml

<?xml version="1.0" encoding="UTF-8"?>
<results title="jrc">
 <result pre=" 7 April 2016). Since Ag " name0="jrc" value0="NM-300" post="K was provided as dispers" xpath="/html[1]/body[1]/div[2]/div[5]/p[3]"/>
 <result pre="ispersant alone, i.e. Ag " name0="jrc" value0="NM-300" post="K DIS, was assessed (NM-x" xpath="/html[1]/body[1]/div[2]/div[5]/p[3]"/>
...
 <result pre="as. The dispersant of Ag " name0="jrc" value0="NM-300" post="K alone (that does not co" xpath="/html[1]/body[1]/div[2]/div[5]/p[3]"/>
 <result pre="e’ in the BCOP assay; Ag " name0="jrc" value0="NM-300" post="K DIS was assessed as ‘no"/>
</results>