MartinPaulEve / meTypeset

meTypeset is a tool to convert from Microsoft Word .docx format to NLM/JATS-XML for scholarly/scientific article typesetting.
Other
89 stars 32 forks source link

Lists which are not properly formatted as lists in Word are not detected #6

Closed axfelix closed 10 years ago

axfelix commented 11 years ago

See http://pkp-udev.lib.sfu.ca/parsingdev/output/eeg_comicsans/eeg_comicsans.xml:

<p>
Modern medicine applies variety of recording techniques to the human body:
</p>
<p>- Electrocardiography (ECG, heart).</p>
<p>- Electromyography (EMG, muscular contractions).</p>
<p>- Electroencephalography (EEG, brain).</p>
<p>- Magnetoencephalography (MEG, brain).</p>
<p>- Electroneurography (ENG)</p>
axfelix commented 11 years ago

See other examples of authors' intended formatting (expressed mostly through misused punctuation) being lost here: http://pkp-udev.lib.sfu.ca/parsingdev/output/sodium/sodium.xml

axfelix commented 10 years ago

This seems like an obvious classifier use case after front matter and headers are fixed...

MartinPaulEve commented 10 years ago

@axfelix: I've now addressed list formatting with dashes and will move on to other group formatting.

I've looked at the sodium test document and, to be honest, there is no way in hell we are going to be able to parse a lot of that. (Although it did provide a good example for fixing linebreaks inside td cells 88ff28b456e1eefcfe439ada2d50a0832f35929e.) The author has, for example, submitted a vertical table dividing line mid-way down a page as some kind of floating, drawn line. There is only so much insanity that we can fix is my view... Let me know (there might be other elements that we /can/ fix in there that I am missing).

axfelix commented 10 years ago

:) Fair enough! I'm glad we've gotten as far as we have; reference identification is probably a higher priority for now anyway.

MartinPaulEve commented 10 years ago

Right, recent commits (4161ee933f6f92105e4b404ba3fe84e505270880 7dadf27f990a6af79d23db0a2e360fc1979f36c1 ea1c36a191fec4e02a7a54feddb4225d9dcda67c and 6c9a70931f33f47bf48930c3e3113797dc380500) add a lot of this functionality.

Lists like (1), (2) etc. are handled Lists like -, - are handled Lists like [1], [2] etc. are treated as a bibliography if there is only one reference to each. If there are more than one, then it treats them as a footnote set and links the first one to the latter.

I suppose some sort of "space tolerance" would be a good plan (for "( 1 )") but beyond that it would be good to talk about how much more is feasible. I'm wary of it getting overly aggressive (we do now have an "aggression" switch)