elifesciences / XML-mapping

Mapping the XML to new Continuum requirements and build of a new Kitchen sink
3 stars 9 forks source link

<disp-formula> tag nesting #92

Closed gnott closed 5 years ago

gnott commented 5 years ago

I noticed in the kitchen sink XML the <disp-formula> tags, which I understand is a formula to be displayed as a block or "call out", they are all nested inside <p> tags.

I can most likely recreate the nesting of these tags when generating JATS XML in the decision letter parser library, but it would be less work if the <disp-formula> tags were their own block level element. This is how the output of pandoc seems to be structured, where is actually creates a new <p><disp-formula> when it encounters a block formula in a .docx file.

Using XQuery, I could find only two eLife XML files that have <disp-quote> tags that are not nested inside a <p> tag, the DOI and @id attribute of the particular tags listed below:

10.7554/eLife.02951, equ3
10.7554/eLife.02951, equ4
10.7554/eLife.00522, equ2
10.7554/eLife.00522, equ3

On the call with Melissa this morning, I wondered whether the <disp-formula> nested inside a <p> tag was something particular to how eLife is doing JATS, or whether this is also present in other publisher's JATS. It was suggested I check with you @FAtherden-eLife for your observations on non-eLife files you may have looked at.

I also wonder, since when the <disp-formula> tag exists as a block level tag (not nested inside a <p> tag) is valid when generating eLife JSON, and presumably it is valid against the PMC schema, would a <disp-formula> tag as a block level element be acceptable for when generating decision letter XML.?

If both block level and nested <disp-formula> tags are valid, should a block level <disp-formula> tag be included in the kitchen sink XML?

cc @Melissa37 too.

fred-atherden commented 5 years ago

Thanks @gnott,

Yes, disp-formula is allowed outside of p (as a child of say, sec) as a block element and within p in JATS, as you say.

It's a conscious choice on our part to include them in p. This is because, many renderers output an indent before the start of a new paragraph (this occurs in our PDFs for example). This can be problematic if the disp-formula is (or has to be) captured as blocks, since many authors continue their paragraph after a formula. This is why we've decided to always place it in p. My understanding is that Libero Editor will be doing the same.

(I think it actually makes more semantic sense in the content anyway, given that they are most often treated like inline equations which have simply been pulled out and labelled to be referenced later in the content.)

I have an archive of >1,000,000 PMC articles (118,171 articles have disp-formula)

Obviously some contain both variants, but it seems to indicate that the latter is used more often. It's the same in other content (Hindawi, Wellcome open res, f1000, bioRxiv, etc.).

Having said all this, we don't include DL/AR in our PDFs, and Continuum doesn't output indents before paragraphs, so our content would be currently unaffected either way (in terms of presentation). I think it would definitely be preferable to always capture them in p, in the interests of consistency across our corpus (and if we were to indent paras in HTML or include DL/AR in PDFs in the future).

However, if it's loads more work then it shouldn't be a problem to treat them as blocks in the DL/AR. I don't know the answer to this question though:

I also wonder, since when the <disp-formula> tag exists as a block level tag (not nested inside a <p> tag) is valid when generating eLife JSON ...

So it might be possible that work would be needed on that end if we went down that route.

Let me know if anything is unclear!

fred-atherden commented 5 years ago

cc @JGilbert-eLife, in case he has anything to add (or feels strongly another way).

JGilbert-eLife commented 5 years ago

Since the D/R WILL be displayed in Libero Editor, however, it may be that we'll end up needing to contain all <disp-formula> tags in <p>s anyway . . .

Beyond that, not really much to add!

gnott commented 5 years ago

It satisfies my curiosity as to why they are inside the <p> tags. I don't anticipate it being loads more work to reproduce the output in this way, I guess it was to understand why we are doing it. Good to know that other software will also conform to the same format.

I also agree for consistency, it is best to recreate the prevailing convention.

It's probably ok to close this issue, all answered so quickly - thanks!