MartinPaulEve / meTypeset

meTypeset is a tool to convert from Microsoft Word .docx format to NLM/JATS-XML for scholarly/scientific article typesetting.
Other
89 stars 32 forks source link

Table/Figure captions often broken when Tables and Figures follow ref-list #109

Open axfelix opened 7 years ago

axfelix commented 7 years ago

Hi Martin,

Something we've been noticing across our corpus and would like to improve is the very low accuracy of tagging table and figure captions in meTypeset when the tables and figures are deliberately positioned at the end of the Word document by the author to meet some journals' formatting requirements.

Sometimes the result is the table or figure caption being tagged as its own paragraph, which I assume can be improved on its own by e.g. making the caption classifier more aggressive or adding another set of linguistic cues as has been done for https://github.com/MartinPaulEve/meTypeset/tree/master/language

However, I'm often seeing the captions being subsumed into the ref-list (though the tables themselves are always detected properly), and this is especially obvious when there are a few tables or figures in a row and only the first one has its title "broken."

What would be the best way for me to address this? I know the bibliography classifier is run before the caption classifiers in https://github.com/MartinPaulEve/meTypeset/blob/master/bin/nlmprocessor.py, and I believe it tries specifically to carve out the last block of unstructured text, so I'd want to be careful to regression test any changes we make to this behaviour.

axfelix commented 7 years ago

It seems like this should be caught by stuff like https://github.com/MartinPaulEve/meTypeset/blob/master/bin/teimanipulate.py#L438 -- are we dropping linebreaks somehow in these cases?

MartinPaulEve commented 7 years ago

Hi Alex,

As ever, if you could upload some Word files that are as minimal as possible and that demonstrate this (and ideally but not necessarily definitely some Robot test cases), we can work out how to fix it...

Best wishes,

Martin

On 25/05/17 00:46, axfelix wrote:

It seems like this /should/ be caught by stuff like https://github.com/MartinPaulEve/meTypeset/blob/master/bin/teimanipulate.py#L438 -- are we dropping linebreaks somehow in these cases?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MartinPaulEve/meTypeset/issues/109#issuecomment-303883585, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_otz34NGvUvI5nZflD8vy-VH3jg66Sks5r9MFSgaJpZM4Nlriw.

-- Professor Martin Paul Eve Chair of Literature, Technology and Publishing Birkbeck, University of London

T: 0203 073 8420 E: martin.eve@bbk.ac.uk W: https://www.martineve.com R: 416, 43 Gordon Square, London, WC1H 0PD

Books: https://www.martineve.com/books/ Articles: https://www.martineve.com/c-v/

Series Editor: New Horizons in Contemporary Writing (Bloomsbury) Director, Birkbeck Centre for Technology and Publishing Founder, Open Library of the Humanities (https://www.openlibhums.org) Chief Editor, Orbit (https://www.pynchon.net) Senior Online Editor, Alluvium, (http://www.alluvium-journal.org)

axfelix commented 7 years ago

table_after_refs.docx

Here's a start, I'll see if I can trim it down further.

axfelix commented 7 years ago

table_after_refs_minimal.docx

It actually breaks considerably worse this way...

Definitely looks like the bibliography classifier is being overzealous, but not too much illuminating in debug output.

axfelix commented 7 years ago

Just prodding around at this point, but after looping through elements_to_parse in teimanipulate and printing their children, it looks like there are definitely some elements getting added to the ref-list that contain only table rows, in the latter test example:

$ metypeset table_after_refs_minimal.docx test [<Element {http://www.tei-c.org/ns/1.0}ref at 0x4022d88>] [<Element {http://www.tei-c.org/ns/1.0}hi at 0x4022d88>, <Element {http://www.tei-c.org/ns/1.0}hi at 0x4022c48>] [<Element {http://www.tei-c.org/ns/1.0}row at 0x4022d88>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4022b48>, <Element {http://www.tei-c.org/ns/1.0}row at0x4022c48>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4022948>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031048>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031788>, <Element {http://www.tei-c.org/ns/1.0}row at 0x40310c8>,<Element {http://www.tei-c.org/ns/1.0}row at 0x4031088>, <Element {http://www.tei-c.org/ns/1.0}row at 0x40311c8>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031108>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031148>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031188>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031588>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031208>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031288>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031348>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031448>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031548>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031fc8>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031888>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031f88>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4036048>] [<Element {http://www.tei-c.org/ns/1.0}hi at 0x4022d88>, <Element {http://www.tei-c.org/ns/1.0}hi at 0x4022b48>]

MartinPaulEve commented 7 years ago

OK, so the reference/bibliography classifier assumes, at the moment, that references and the bibliography are always the last items in a document.

If they are not, then behaviour is, at the moment, undefined.

The challenge for changing this is that the assumption that it's the last item allows us to continue parsing references even in cases where it doesn't necessarily look like a citation (thus giving better results on biblio parsing).

If this is coming up a lot, we will have to think about this tho.

Best wishes,

Martin

On 26/05/17 19:17, axfelix wrote:

Just prodding around at this point, but after looping through |elements_to_parse| in teimanipulate and printing their children, it looks like there are definitely some elements getting added to the ref-list that contain only table rows, in the latter test example:

|$ metypeset table_after_refs_minimal.docx test [<Element {http://www.tei-c.org/ns/1.0}ref at 0x4022d88>] [<Element {http://www.tei-c.org/ns/1.0}hi at 0x4022d88>, <Element {http://www.tei-c.org/ns/1.0}hi at 0x4022c48>] [<Element {http://www.tei-c.org/ns/1.0}row at 0x4022d88>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4022b48>, <Element {http://www.tei-c.org/ns/1.0}row at0x4022c48>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4022948>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031048>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031788>, <Element {http://www.tei-c.org/ns/1.0}row at 0x40310c8>,<Element {http://www.tei-c.org/ns/1.0}row at 0x4031088>, <Element {http://www.tei-c.org/ns/1.0}row at 0x40311c8>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031108>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031148>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031188>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031588>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031208>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031288>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031348>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031448>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031548>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031fc8>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031888>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4031f88>, <Element {http://www.tei-c.org/ns/1.0}row at 0x4036048>] [<Element {http://www.tei-c.org/ns/1.0}hi at 0x4022d88>, <Element {http://www.tei-c.org/ns/1.0}hi at 0x4022b48>]|

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MartinPaulEve/meTypeset/issues/109#issuecomment-304353539, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_ot6PgtumcR-X5sojvsJCpsSWmOa_Xks5r9xc0gaJpZM4Nlriw.

-- Professor Martin Paul Eve Chair of Literature, Technology and Publishing Birkbeck, University of London

T: 0203 073 8420 E: martin.eve@bbk.ac.uk W: https://www.martineve.com R: 416, 43 Gordon Square, London, WC1H 0PD

Books: https://www.martineve.com/books/ Articles: https://www.martineve.com/c-v/

Series Editor: New Horizons in Contemporary Writing (Bloomsbury) Director, Birkbeck Centre for Technology and Publishing Founder, Open Library of the Humanities (https://www.openlibhums.org) Chief Editor, Orbit (https://www.pynchon.net) Senior Online Editor, Alluvium, (http://www.alluvium-journal.org)

axfelix commented 7 years ago

Yeah, that's what I thought. As a starting point for fixing this, should I be trying to prevent problematic items from being added to that list at all? Or remove them once they're there? I'm not sure which has more implications for the reference classifier "leaving them alone".

We are seeing several documents submitted to some journals that always have tables and figures after references.

axfelix commented 7 years ago

Hey Martin,

Any chance you can revisit this with me?

MartinPaulEve commented 7 years ago

Hi Alex,

Will get back to you as soon as I can -- later this week -- on this. Could you please, in the meantime, just update me on what's needed here (and is there a test document?)

Best wishes,

Martin

On 19/06/17 16:26, axfelix wrote:

Hey Martin,

Any chance you can revisit this with me?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MartinPaulEve/meTypeset/issues/109#issuecomment-309475052, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_ot-DcccK-4Kirf4LoEgiflw66YHyaks5sFpMfgaJpZM4Nlriw.

-- Professor Martin Paul Eve Chair of Literature, Technology and Publishing Birkbeck, University of London

T: 0203 073 8420 E: martin.eve@bbk.ac.uk W: https://www.martineve.com R: 416, 43 Gordon Square, London, WC1H 0PD

Books: https://www.martineve.com/books/ Articles: https://www.martineve.com/c-v/

Series Editor: New Horizons in Contemporary Writing (Bloomsbury) Director, Birkbeck Centre for Technology and Publishing Founder, Open Library of the Humanities (https://www.openlibhums.org) Chief Editor, Orbit (https://www.pynchon.net) Senior Online Editor, Alluvium, (http://www.alluvium-journal.org)

axfelix commented 7 years ago

Sure! The two test documents I uploaded earlier in this thread (a couple comments up) should hopefully be illuminating and work for this purpose -- basically, the bibliography classifier needs to not ruin parsing of tables (especially table captions) that are located below the bibliography.

I'll be on vacation the rest of the week myself and at a conference next week, but let me know what I can do to help further.

MartinPaulEve commented 7 years ago

OK, so I've looked at this a bit further, and I am not sure quite what we should do with it. Should tables be appended to the end of the body? Obviously, JATS separates out the ref-list from the body. We assume that once we've got to the REF list we're done with the body.

If we want to include this in the body, we need to add special handling for content that should never occur in a ref-list that would need to include tables.

Let me know your thoughts.

axfelix commented 7 years ago

I think appending it to the body is a good idea. Would the blacklist implementation you're proposing be sufficient to catch table captions if they precede the table itself and otherwise look like "normal" paragraphs? It's probably not the most salient issue, but it is what led to us looking into this in the first place...