elifesciences / elife-tools

Python library for parsing eLife article XML data.
MIT License
15 stars 7 forks source link

Structured abstract parsing #320

Closed gnott closed 4 years ago

gnott commented 4 years ago

Re issue https://github.com/elifesciences/issues/issues/4622

There was an existing test fixture XML, based on a BMJ Open article, the tests for which show how the older, more basic abstract() function omits the section title values and only retains the paragraph content. This is still the case for now.

The more recent function abstract_json(), which calls render_abstract_json(), was created to produce abstract content in an eLife JSON format that validates against the RAML schema.

When rendering the abstract's content, using body_block_content_render(), which recursively traverses child tags in the XML, instead of using just body_block_content(), the output includes section and paragraph blocks in the structured format.

In the eLife XML example, there is also a <related-object> tag, holding the clinical trial information. For now, it is agreed this can be converted to a paragraph block, and the <related-object> tag itself, when converted to HTML, can be an <a> anchor tag.

I think I left in parsing the @id attribute of the <related-object> tag, and it gets added to the paragraph block as an attribute. I believe, in the RAML schema, there is no @id attribute listed for a paragraph, but I think if it remains there the RAML schema validation will not care. If we should remove the id attribute from the output, that option is possible.

These code changes do not cause any other existing test cases in this library to fail. If we parse XML which has only <p> tags in the <abstract> tag, then the output should remain the same as it was, and it will continue to be valid against the RAML article v2 schema.

In order to support abstracts that also include section blocks, we can introduce this parser change with little risk, and then the adapatations to the RAML schema can continue.

I don't know the exact timing of when structured abstract XML for eLife will appear, except to know we need to do all this work in preparation before the first structured abstract can be allowed to pass through the workflows and displayed on journal.

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-0.04%) to 99.728% when pulling f061ffa2a3292cdc507ca591cce84d8080b009f2 on structured-abstract into 045a279275d78464533f84aad4d539e8ba44ff05 on develop.

gnott commented 4 years ago

I merged in develop branch after merging PR https://github.com/elifesciences/elife-tools/pull/321 which should hopefully fix why Alfred marked his tests as failing.

gnott commented 4 years ago

Thanks for looking it over @lsh-0, providing comments and approval!

I'm thinking to hold off merging for now because there's a clarification question still outstanding of whether the <related-object> will be wrapped in a <p> tag or not. I hoping it will be, because then we don't have to consider it as an entirely new paragraph block every time the parser converts one of these tags in a body element. But, if the XML as specified in the test fixture is decided to be the final XML (where the <related-object> tag is just inside a <sec> tag but not wrapped in a <p>) then this demonstrates we can support that.

gnott commented 4 years ago

Confirmation has been received that the <related-object> tag in the structured abstracts example will be wrapped with a <p> tag. The most recent commit here I just added makes it so we do not need to consider <related-object> tags as block content elements, which I think will be less risky in the future. Considering this small edit, I'll accept the earlier approval of this PR (thanks @lsh-0!) to still apply, and I will merge this PR.