Closed pafonta closed 2 years ago
Like dicussed in the meeting, maybe fixing #425 will automatically fix this duplication issue.
After analysing this issue, I am wondering if we should not keep everything into one paragraph.
Concrete example:
Note: text
comes from /raid/projects/bbs/pmc/comm_use/Environ_Sci_Eur/PMC5602039.nxml
.
from defusedxml import ElementTree
text = """<p id="Par28">Due to the tight interconnection of drinking water and groundwater,
the latter is submitted to the same limit values (0.1 µg/L
for single substances and 0.5 µg/L for the sum of substances,
Groundwater Directive 2006/118/EC) [<xref ref-type="bibr" rid="CR23">23</xref>].
Although these directives fix the limit values, no criteria for the definition of
relevant and non-relevant metabolites were provided at that stage,
leading to uncertainty for regulators and notifiers. To avoid further misinterpretations,
a definition of the relevance criteria was finally delivered in 2000.
According to the EU DG Sanco Guidance, a metabolite is relevant if:
<list list-type="alpha-lower"><list-item><p id="Par29">It has comparable biological target activity
(≥50%) to the active substance, or</p></list-item><list-item><p id="Par30">
It has toxicological properties that are regarded as severe or unacceptable
(e.g., genotoxic, or classified as toxic or very toxic) [<xref ref-type="bibr" rid="CR24">24</xref>].
</p></list-item></list></p>"""
for p in ElementTree.fromstring(test).findall(".//p"):
print("".join(p.itertext()))
Outputs:
It has comparable biological target activity (≥50%) to the active substance, or
It has toxicological properties that are regarded as severe or unacceptable (e.g., genotoxic, or classified as toxic or very toxic) [24].
Why ?
Due to the tight interconnection of drinking water and groundwater, ...
Hi @EmilieDel,
I think the issue is coming from the use of itertext()
.
Indeed, using it as intended (doc) gives the expected results:
content = ElementTree.fromstring(text)
print("".join(content.itertext()))
which gives:
Due to the tight [...] a metabolite is relevant if:
It has comparable [...], or
It has toxicological [...].
@EmilieDel I think the problem in your example is that the first <p>
is the top element and it's not covered by findall
, because it's looking for its sub-elements, but not at the top element itself.
Just for clarity, here's the re-formatted example you gave:
<p id="Par28">
Due to the tight interconnection of drinking water and groundwater,
the latter is submitted to the same limit values (0.1 µg/L
for single substances and 0.5 µg/L for the sum of substances,
Groundwater Directive 2006/118/EC) [<xref ref-type="bibr" rid="CR23">23</xref>].
Although these directives fix the limit values, no criteria for the definition of
relevant and non-relevant metabolites were provided at that stage,
leading to uncertainty for regulators and notifiers. To avoid further misinterpretations,
a definition of the relevance criteria was finally delivered in 2000.
According to the EU DG Sanco Guidance, a metabolite is relevant if:
<list list-type="alpha-lower">
<list-item>
<p id="Par29">
It has comparable biological target activity
(≥50%) to the active substance, or
</p>
</list-item>
<list-item>
<p id="Par30">
It has toxicological properties that are regarded as severe or unacceptable
(e.g., genotoxic, or classified as toxic or very toxic) [<xref ref-type="bibr" rid="CR24">24</xref>].
</p>
</list-item>
</list>
</p>
and to fix the issue it's enough to wrap everything into an arbitrary top-level element, something like this:
text2 = "<xml><p id="Par28">...</p></xml>"
So, starting with your example I get:
>>> text2 = f"<xml>{text}</xml>"
>>> content = ElementTree.fromstring(text2)
>>> for element in content.findall(".//p"):
... print("".join(element.itertext()))
Due to the tight interconnection of drinking water and groundwater,
the latter is submitted to the same limit values (0.1 µg/L
for single substances and 0.5 µg/L for the sum of substances,
Groundwater Directive 2006/118/EC) [23].
Although these directives fix the limit values, no criteria for the definition of
relevant and non-relevant metabolites were provided at that stage,
leading to uncertainty for regulators and notifiers. To avoid further misinterpretations,
a definition of the relevance criteria was finally delivered in 2000.
According to the EU DG Sanco Guidance, a metabolite is relevant if:
It has comparable biological target activity
(≥50%) to the active substance, or
It has toxicological properties that are regarded as severe or unacceptable
(e.g., genotoxic, or classified as toxic or very toxic) [24].
It has comparable biological target activity
(≥50%) to the active substance, or
It has toxicological properties that are regarded as severe or unacceptable
(e.g., genotoxic, or classified as toxic or very toxic) [24].
One can see that we do find all <p>
elements.
This actually also showcases where the duplication comes from. The .//
syntax of XPath is recursive, but so is element.itertext()
. This way you parse the inner paragraphs twice:
element.itertext()
on the outer <p>
<p>
s via findall(".//p")
and then call element.itertext()
on them again.Thanks @pafonta, @Stannislav for the feedback!
To @pafonta:
I agree we could say the problem is coming from itertext()
, in this case it is a bit unfortunate that the children are themselves paragraphs. However, in most of the cases, it is nice to have itertext
to handle all the children (bold, italic, URL, ...) in the paragraph directly with itertext()
. What do you think ?
To @Stannislav:
Sorry, my example was maybe not clear. I totally agree with your comment, it is indeed the reason this PR was opened (and also what is shown in the description of the PR). What I wanted to explain by the second example (the one you mention in your comment) is that if the parent <p>
contains text then only parsing the children <p>
is not enough (what was suggested as solution of the issue).
The second example was there to explain why I think we should only keep the parent <p>
and do element.itertext()
on it to reconstruct the entire paragraph without losing any information. It also means that we can ignore the children as they are parsed once in the parent paragraph. For info, this is in fact the current implementation of this. However, I am aware that it makes the solution/code more complicated, as @pafonta pointed out. So, I am totally open to suggestions!
Moreover, as you pointed out .//
is recursive, I thought first about getting rid of the double-slash in the XPath
. However, this solution seems impossible as the paths vary a lot and it sounds complicated to predict all of them.
Hello @EmilieDel,
I agree we could say the problem is coming from
itertext()
, in this case it is a bit unfortunate that the children are themselves paragraphs. However, in most of the cases, it is nice to haveitertext
to handle all the children (bold, italic, URL, ...) in the paragraph directly withitertext()
. What do you think ?
I see. To do so, using a non
recursive approach - not using findall(".//p")
- should give the expected behaviour. One could iterate through the XML tree instead to do such thing.
I see. To do so, using a non recursive approach - not using findall(".//p") - should give the expected behaviour. One could iterate through the XML tree instead to do such thing.
After discussing with @Stannislav, we will go for that solution (i.e. iterate through the XML tree). I will try to implement this asap!
Thanks a lot for the feedback @Stannislav and @pafonta :)
🐛 Bug description
Currently, the elements of lists are duplicated by the parser
PMCXMLParser
.⚠️ This bug could also happen for
<def-list>
(see below) or, actually, any other tag.There are two possible types of lists:
<list>
and<def-list>
.To reproduce
Example XML of an article with a list in a paragraph:
Parsing of such article file (article.xml):
Duplicated outputs:
Expected behavior
Output: