BlueBrain / Search

Blue Brain text mining toolbox for semantic search and structured information extraction
https://blue-brain-search.readthedocs.io
GNU Lesser General Public License v3.0
40 stars 10 forks source link

Fix parsing of <list> for JATS XML #496

Closed pafonta closed 2 years ago

pafonta commented 2 years ago

🐛 Bug description

Currently, the elements of lists are duplicated by the parser PMCXMLParser.

⚠️ This bug could also happen for <def-list> (see below) or, actually, any other tag.

There are two possible types of lists: <list> and <def-list>.

To reproduce

Example XML of an article with a list in a paragraph:

<article>
    <body>
        <p>
            <list list-type="bullet">
                <list-item>
                    <p>List item 1.</p>
                </list-item>
                <list-item>
                    <p>List item 2.</p>
                </list-item>
            </list>
        </p>
    </body>
</article>

Parsing of such article file (article.xml):

from bluesearch.database.article import JATSXMLParser

parser = JATSXMLParser("article.xml")
for paragraph in parser.paragraphs:
  print(paragraph)

Duplicated outputs:

('', 'List item 1.\n                List item 2.')
('', 'List item 1.')
('', 'List item 2.')

Expected behavior

Output:

('', 'List item 1.')
('', 'List item 2.')
Stannislav commented 2 years ago

Like dicussed in the meeting, maybe fixing #425 will automatically fix this duplication issue.

EmilieDel commented 2 years ago

After analysing this issue, I am wondering if we should not keep everything into one paragraph.

Concrete example: Note: text comes from /raid/projects/bbs/pmc/comm_use/Environ_Sci_Eur/PMC5602039.nxml.

from defusedxml import ElementTree

text = """<p id="Par28">Due to the tight interconnection of drinking water and groundwater,
 the latter is submitted to the same limit values (0.1&#x000a0;&#x000b5;g/L 
for single substances and 0.5&#x000a0;&#x000b5;g/L for the sum of substances, 
Groundwater Directive 2006/118/EC) [<xref ref-type="bibr" rid="CR23">23</xref>]. 
Although these directives fix the limit values, no criteria for the definition of 
relevant and non-relevant metabolites were provided at that stage, 
leading to uncertainty for regulators and notifiers. To avoid further misinterpretations, 
a definition of the relevance criteria was finally delivered in 2000. 
According to the EU DG Sanco Guidance, a metabolite is relevant if:
<list list-type="alpha-lower"><list-item><p id="Par29">It has comparable biological target activity 
(&#x02265;50%) to the active substance, or</p></list-item><list-item><p id="Par30">
It has toxicological properties that are regarded as severe or unacceptable
 (e.g., genotoxic, or classified as toxic or very toxic) [<xref ref-type="bibr" rid="CR24">24</xref>].
</p></list-item></list></p>"""
for p in ElementTree.fromstring(test).findall(".//p"):
    print("".join(p.itertext()))

Outputs:

It has comparable biological target activity (≥50%) to the active substance, or
It has toxicological properties that are regarded as severe or unacceptable (e.g., genotoxic, or classified as toxic or very toxic) [24].

Why ?

pafonta commented 2 years ago

Hi @EmilieDel,

I think the issue is coming from the use of itertext().

Indeed, using it as intended (doc) gives the expected results:

content = ElementTree.fromstring(text)
print("".join(content.itertext()))

which gives:

Due to the tight [...] a metabolite is relevant if:
It has comparable [...], or
It has toxicological [...].
Stannislav commented 2 years ago

@EmilieDel I think the problem in your example is that the first <p> is the top element and it's not covered by findall, because it's looking for its sub-elements, but not at the top element itself.

Just for clarity, here's the re-formatted example you gave:

<p id="Par28">
Due to the tight interconnection of drinking water and groundwater,
 the latter is submitted to the same limit values (0.1&#x000a0;&#x000b5;g/L 
for single substances and 0.5&#x000a0;&#x000b5;g/L for the sum of substances, 
Groundwater Directive 2006/118/EC) [<xref ref-type="bibr" rid="CR23">23</xref>]. 
Although these directives fix the limit values, no criteria for the definition of 
relevant and non-relevant metabolites were provided at that stage, 
leading to uncertainty for regulators and notifiers. To avoid further misinterpretations, 
a definition of the relevance criteria was finally delivered in 2000. 
According to the EU DG Sanco Guidance, a metabolite is relevant if:
    <list list-type="alpha-lower">
        <list-item>
            <p id="Par29">
            It has comparable biological target activity 
            (&#x02265;50%) to the active substance, or
            </p>
        </list-item>
        <list-item>
            <p id="Par30">
            It has toxicological properties that are regarded as severe or unacceptable
             (e.g., genotoxic, or classified as toxic or very toxic) [<xref ref-type="bibr" rid="CR24">24</xref>].
            </p>
        </list-item>
    </list>
</p>

and to fix the issue it's enough to wrap everything into an arbitrary top-level element, something like this:

text2 = "<xml><p id="Par28">...</p></xml>"

So, starting with your example I get:

>>> text2 = f"<xml>{text}</xml>"
>>> content = ElementTree.fromstring(text2)
>>> for element in content.findall(".//p"):
...     print("".join(element.itertext()))
Due to the tight interconnection of drinking water and groundwater,
 the latter is submitted to the same limit values (0.1 µg/L 
for single substances and 0.5 µg/L for the sum of substances, 
Groundwater Directive 2006/118/EC) [23]. 
Although these directives fix the limit values, no criteria for the definition of 
relevant and non-relevant metabolites were provided at that stage, 
leading to uncertainty for regulators and notifiers. To avoid further misinterpretations, 
a definition of the relevance criteria was finally delivered in 2000. 
According to the EU DG Sanco Guidance, a metabolite is relevant if:
It has comparable biological target activity 
(≥50%) to the active substance, or
It has toxicological properties that are regarded as severe or unacceptable
 (e.g., genotoxic, or classified as toxic or very toxic) [24].

It has comparable biological target activity 
(≥50%) to the active substance, or

It has toxicological properties that are regarded as severe or unacceptable
 (e.g., genotoxic, or classified as toxic or very toxic) [24].

One can see that we do find all <p> elements.

This actually also showcases where the duplication comes from. The .// syntax of XPath is recursive, but so is element.itertext(). This way you parse the inner paragraphs twice:

EmilieDel commented 2 years ago

Thanks @pafonta, @Stannislav for the feedback!

To @pafonta:

I agree we could say the problem is coming from itertext(), in this case it is a bit unfortunate that the children are themselves paragraphs. However, in most of the cases, it is nice to have itertext to handle all the children (bold, italic, URL, ...) in the paragraph directly with itertext(). What do you think ?

To @Stannislav:

Sorry, my example was maybe not clear. I totally agree with your comment, it is indeed the reason this PR was opened (and also what is shown in the description of the PR). What I wanted to explain by the second example (the one you mention in your comment) is that if the parent <p> contains text then only parsing the children <p> is not enough (what was suggested as solution of the issue).

The second example was there to explain why I think we should only keep the parent <p> and do element.itertext() on it to reconstruct the entire paragraph without losing any information. It also means that we can ignore the children as they are parsed once in the parent paragraph. For info, this is in fact the current implementation of this. However, I am aware that it makes the solution/code more complicated, as @pafonta pointed out. So, I am totally open to suggestions!

Moreover, as you pointed out .// is recursive, I thought first about getting rid of the double-slash in the XPath. However, this solution seems impossible as the paths vary a lot and it sounds complicated to predict all of them.

pafonta commented 2 years ago

Hello @EmilieDel,

I agree we could say the problem is coming from itertext(), in this case it is a bit unfortunate that the children are themselves paragraphs. However, in most of the cases, it is nice to have itertext to handle all the children (bold, italic, URL, ...) in the paragraph directly with itertext(). What do you think ?

I see. To do so, using a non recursive approach - not using findall(".//p") - should give the expected behaviour. One could iterate through the XML tree instead to do such thing.

EmilieDel commented 2 years ago

I see. To do so, using a non recursive approach - not using findall(".//p") - should give the expected behaviour. One could iterate through the XML tree instead to do such thing.

After discussing with @Stannislav, we will go for that solution (i.e. iterate through the XML tree). I will try to implement this asap!

Thanks a lot for the feedback @Stannislav and @pafonta :)