NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
28 stars 13 forks source link

EMLText rendering issue in the EML MetacatUI stylesheets for itemizedList inside a para element #1224

Open laurenwalker opened 6 years ago

laurenwalker commented 6 years ago

When an itemizedList is inside a para, the HTML output is <p><ul>...</ul></p> which is rendering incorrectly by the browser.

Take this example HTML:

<html>
  <body>
    <p> first paragraph
      <ul>
        <li>list item</li>
      </ul>
      second paragraph
    </p>
  </body>
</html>

and open it in your browser and notice that the browser attempts to render it as:

<p>first paragraph</p>
<ul><li>list item</li></ul>
"second paragraph"
<p></p>

In other words, the browser closes the paragraph tag before starting the unordered list and then adds an opening tag to the second paragraph tag. The second paragraph text is then "orphaned" text in the HTML doc.

This is only an issue because it effects the way CSS is applied to the text (note the larger font size):

screen shot 2018-03-15 at 3 57 11 pm

The EML snippet:

<abstract>
  <section>
    <para>
These files contain data representing the periodic plant measures of all species within each plot in a CSV format. The data presented are phenological development (date of leaf bud burst, inflorescence emergence, flower bud, flower opening, flower withering, seed development, seed dispersal, and senescence), seasonal growth (length of leaf, and length of inflorescence), seasonal flowering (number of inflorescences in flower within a plot), occurrence of events (yes or no for leaf, inflorescence, bud, flower, and seed), and annual growth and reproductive effort (number of leaves, diameter of rosette, number of branches, maximum leaf length, number of inflorescences, maximum inflorescence length, number of buds, number of flowers, and number of seeds) collected weekly or yearly for all plant species during the summers of 1994-2014 for 48 plots (24 experiment open-top chamber plots and 24 control plots) at four sites (Atqasuk Wet Meadow, Atqasuk Dry Heath, Barrow Wet Meadow, and Barrow Dry Heath). Plant development was followed throughout the entire summer. Plant measures were determined based on species morphology and ease of information collection. Within each plot three permanently marked individuals were monitored for each species if possible. Due to the low percentage of flowering, data on reproductive traits required the measurement of non-tagged plants. Four different data types were collected. They were:
<itemizedlist><listitem><para>
1) 1-3 permanently marked individual plants of each species within a plot
</para></listitem><listitem><para>
2) total plot measures of a species, such as the number of flowers per plot or the first occurrence of a phenophase
</para></listitem><listitem><para>
3) the 1-3 largest reproductive individual plants of a species within a plot and
</para></listitem><listitem><para>
4) the 1-3 largest vegetative individual plants of a species within a plot.
</para></listitem></itemizedlist>
For species such as graminoids that do not form distinct individuals unit areas were established to monitor change over years. The size of unit areas of Carex aquatilis subspecies stans in the Barrow Wet Meadow site and all the species in the Atqasuk Wet Meadow site was 10 by 10 cm. All other unit areas were 5 by 5 cm in size.
</para>
  </section>
</abstract>
amoeba commented 6 years ago

Thanks for writing this up, @laurenwalker . I'm happy to take this one.

csjx commented 6 years ago

Fun: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/p

The start tag is required. The end tag may be omitted if the <p> element is immediately followed by an ... <ul> ...

So the browser is just following the rules. Perhaps docBook rules are different.

csjx commented 6 years ago

I guess it depends on how you interpret "immediately followed by".

amoeba commented 6 years ago

This is kinda tricky. I thought up a couple options:

Option 1 Don't use <p> tags at all in eml-text.xsl

I think the fix might be to use <div>s instead of <p>s since it's valid to have block-level elements inside <div>s but not <p>s. We lose some ground on the semantic web battle but I think it'd work.

Convert this EML

<para>
  A paragraph that is interrupted by a list
  <itemizedlist>
    <listitem>
      <para>
        Interrupting list
      </para>
    </listitem>
  </itemizedlist>
  and finished with more text
</para>

to this HTML:

<div>
  A paragraph that is interrupted by a list
  <ul>
    <li>
      <div>
        Interrupting list
      </div>
    </li>
  </ul>
  and finished with more text
</div>

It would seem possible to only convert some <p> tags to <div>s based upon their parent element but, due to the recursive nature of DocBook, I feel like I might not get this right.

Option 2: Just change our styling so it "looks" right?

I'm not sure how if I feel all-in for Option 1 but I think it's my current favorite. What do others think?

amoeba commented 6 years ago

I thought I'd check what the "official" DocBook XSLs do with the above snippet:

      <p>
        A paragraph that is interrupted by a list
      </p>
      <div class="itemizedlist">
        <ul class="itemizedlist" style="list-style-type: disc; ">
          <li class="listitem">
            <p>
              Interrupting list
            </p>
          </li>
        </ul>
      </div>
      <p>
        and finished with more text
      </p>

I looked at the underlying XSLs and I'm not totally sure how they achieve this behavior but it does work. Maybe we should just do this?

amoeba commented 6 years ago

Looked into how DocBook handles this: It's trickier than my current XSL chops can handle. There are two relevant XSLs, with the magic appearing to be done by the unwrap.p template:

block.xsl ```xml

```
html-rtf.xsl ```xml ```

Would we ever consider just importing the appropriate DocBook XSL suite here?