Vitaliy-1 / JATSParser

JATSParser is aimed to be integrated with Open Journal Systems 3.0+ for transforming JATS XML to various formats
GNU General Public License v3.0
11 stars 21 forks source link

Figures inside a <p> aren't recognized correctly. #19

Open marciuz opened 4 years ago

marciuz commented 4 years ago

It seems that \<fig> tags are not correctly recognized if present inside a \<p>.

Actually, according to the documentation this is possible (https://jats.nlm.nih.gov/publishing/tag-library/1.1/element/fig.html).

I've been trying to replace ./fig with .//fig here, but it doesn't seem to be going right... https://github.com/Vitaliy-1/JATSParser/blob/2e2a8aeddc4a2423e8afcb642cf6e836e6eeaa11/src/JATSParser/Body/Document.php#L118

Vitaliy-1 commented 4 years ago

Yeah, I was planning to support only subset of JATS - DAR: https://github.com/substance/dar/blob/master/DarArticle.md

The current JATS Parser object model allows block elements, like figures, tables and lists, only inside section or a document body. Also, it parses elements inside document consequently, to preserve the structure.

If implementing this feature, I would create an array inside paragraph element and a getter (e.g., getBlockelements()): https://github.com/Vitaliy-1/JATSParser/blob/2e2a8aeddc4a2423e8afcb642cf6e836e6eeaa11/src/JATSParser/Body/Par.php that will contain block elements (e.g., figures and tables) and then write it into result HTML after the paragraph, where they appear.

Say, checking after this line: https://github.com/Vitaliy-1/JATSParser/blob/2e2a8aeddc4a2423e8afcb642cf6e836e6eeaa11/src/JATSParser/HTML/Document.php#L118 where paragraph data is set if it contains block elements ($par->getBlockElements()). And then add those elements in the array just after the paragraph. I don't remember if PHP allows that dynamically.

I'll take a look if there is an easy way around without changing the logic and leaving code readable.

Vitaliy-1 commented 4 years ago

Let me know if it works.

I definitely need to refactor JATSParser\Body\Document::getContent() and JATSParser\BodySection::getContent().

Do you have a strong opinion regarding lists inside paragraph? They can't just be put after the paragraph because semantically are linked to the text inside it. One of the options could be to break paragraph into 2 and place list between.

marciuz commented 4 years ago

Hi,

Yes, probably the code needs a general refactory. I'll tell you what I've found so far:

Please let me know how can I help

Best regards

Marcello

Il giorno ven 19 giu 2020 alle ore 18:52 Vitaliy notifications@github.com ha scritto:

Let me know if it works.

I definitely need to refactor JATSParser\Body\Document::getContent() and JATSParser\BodySection::getContent().

Do you have a strong opinion regarding lists inside paragraph? They can't just be put after the paragraph because semantically are linked to the text inside it. One of the options could be to break paragraph into 2 and place list between.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Vitaliy-1/JATSParser/issues/19#issuecomment-646750269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKKXE2WQ62P2NI2BH3PS7DRXOJTRANCNFSM4OBNL4AA .

Vitaliy-1 commented 4 years ago

The idea of placing block elements outside the paragraph arose from compatibility with WYSIWYG editors, like Texture, TinyMCE, or ProseMirror. JATS XML standard is quite flexible regarding where to put tags or mixed elements but it's not that simple for machine readability. I think that was the reason behind creation of DAR subset of JATS XML and JATS4R initiative.

Regarding metadata, I think it's better to use object-oriented approach with getters for data extraction. Another possibility would be to create a generic service class, which allows simple interactions with XML's metadata (something similar can be seen in Laravel Framework). Moreover, I was thinking about this approach for the whole document but the way the data is presented in the article's body doesn't allow this, especially when talking about mixed elements inside paragraphs.

The current approach for parsing paragraphs has similarities with how it's done in OOXML, where paragraphs are flat and contain only text runs. Although it's a usual thing for OOXML for figures to be placed inside a paragraph, it's treated as a separate element. Recently I've explored ProseMirror and discovered that it also uses the same thing - flattening paragraph's content. See: https://prosemirror.net/docs/guide/#doc.structure

My aim right now is to create a full JATS XML workflow starting from parsing author's manuscript (either with Grobid, meTypeset or my own docxToJats converter), editing it with WYSIWYG editor and presenting on the front-end as HTML and PDF. But I need to confess that my knowledge in this area is not great and I'm open for suggestion as long as they are in line with the current plan.