MartinPaulEve / meTypeset

meTypeset is a tool to convert from Microsoft Word .docx format to NLM/JATS-XML for scholarly/scientific article typesetting.
Other
89 stars 32 forks source link

Graphic elements contain text that isn't wrapped in label or caption #115

Open axfelix opened 6 years ago

axfelix commented 6 years ago

Getting invalid JATS, with plaintext that should be wrapped in a caption element, as the value of graphic, as below:

<fig position="float" orientation="portrait"><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="media/image1.jpeg" position="float" orientation="portrait" xlink:type="simple"/>Fig. 3. The structure of a multidimensional control system for ceramsite burning: EM &#8211; an electromechanical part; <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="media/image2.wmf" position="float" orientation="portrait" xlink:type="simple"/>&#8211; a vector specifying exposure; D &#8211; a temperature sensor</fig>

From this doc: 1339-5501-1-LE.docx

axfelix commented 6 years ago

Am guessing it's an edge case around

https://github.com/MartinPaulEve/meTypeset/blob/master/bin/captionclassifier.py#L193

but not too sure what's happening here...

MartinPaulEve commented 6 years ago

Thanks for this, Alex -- and for the minimal test case.

I'll take a look at the weekend!

M

On 16/01/18 20:42, axfelix wrote:

Getting invalid JATS, with plaintext that should be wrapped in a caption element, as the value of graphic, as below:

|<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="media/image1.jpeg" position="float" orientation="portrait" xlink:type="simple"/>Fig. 3. The structure of a multidimensional control system for ceramsite burning: EM – an electromechanical part; <graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="media/image2.wmf" position="float" orientation="portrait" xlink:type="simple"/>– a vector specifying exposure; D – a temperature sensor|

From this doc: 1339-5501-1-LE.docx https://github.com/MartinPaulEve/meTypeset/files/1636721/1339-5501-1-LE.docx

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MartinPaulEve/meTypeset/issues/115, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_ot3caZZaC419kHv4TbD-ZB7bggHBxks5tLQmkgaJpZM4RgZWy.

-- Professor Martin Paul Eve Chair of Literature, Technology and Publishing Birkbeck, University of London

T: 0203 073 8420 E: martin.eve@bbk.ac.uk W: https://www.martineve.com R: 416, 43 Gordon Square, London, WC1H 0PD

Books: https://www.martineve.com/books/ Articles: https://www.martineve.com/c-v/

Series Editor: New Horizons in Contemporary Writing (Bloomsbury) Director, Birkbeck Centre for Technology and Publishing Founder, Open Library of the Humanities (https://www.openlibhums.org) Chief Editor, Orbit (https://www.pynchon.net) Senior Online Editor, Alluvium, (http://www.alluvium-journal.org)

MartinPaulEve commented 6 years ago

Hi Alex,

OK, so I've done some investigation of the problem here and have got this far:

<fig position="float" orientation="portrait"><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="media/image1.jpeg" id="IDd73b995a-a3f3-4940-9d03-e8db274d85f9" position="float" orientation="portrait" xlink:type="simple"><label>Fig</label><caption><p>3 The structure of a multidimensional control system for ceramsite burning: EM &#8211; an electromechanical part;</p></caption></graphic><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="media/image2.png" position="float" orientation="portrait" xlink:type="simple"/>&#8211; a vector specifying exposure; D &#8211; a temperature sensor</fig>

The problem here is that the caption contains an image. So, unfortunately, the caption is split into two tail blocks across two different elements.

I'm not really sure that we can fix this; are images even allowed in image captions?

Any thoughts welcome.

axfelix commented 6 years ago

Oh boy. It looks like there are technically valid ways to include rich media in captions (either through inline-graphic or alternatives, but ... it's not clear that's the intended behaviour in this or in any other case we'll see.

I'd be tempted to just insert </fig><fig> in the middle of any time we see </graphic><graphic> to be honest...

axfelix commented 6 years ago

Jaiden Dembo.docx

another example, should be slightly less problematic to fix

(not sure why we're seeing more of these lately)