DCLP / dclpxsltbox

Sandbox for development, testing, and review of XSLT for DCLP
http://dclp.github.io/dclpxsltbox/
1 stars 5 forks source link

invalid XML in <objectDesc> #50

Closed paregorios closed 8 years ago

paregorios commented 10 years ago

The recently implemented solution for more prosey object description was to introduce an initial typed paragraph for use in display and then to follow it with <supportDesc> etc. Here's an example:

<physDesc>
   <objectDesc form="roll">
      <p type="bookForm">papyrus roll (columns: 36, written lines: [34, 35])</p>
      <supportDesc>
         <support>
            <material>papyrus</material>
         </support>
      </supportDesc>
      <layoutDesc>
         <layout columns="36" writtenLines="[34, 35]"/>
      </layoutDesc>
   </objectDesc>
</physDesc>

Unfortunately, this example throws a number of validation errors against the EpiDoc schema, all of which are exemplary of similar problems across the entire dataset:

attribute "type" not allowed on <p>; expected attribute "ana", "change", "copyOf", "corresp", "decls", "exclude", "facs", "n", "next", "part", "prev", "rend", "rendition", "sameAs", "select", "style", "synch", "xml:base", "xml:id", "xml:lang" or "xml:space" element "supportDesc" not allowed here; expected the element end-tag or element "ab" or "p" element "layoutDesc" not allowed here; expected the element end-tag or element "ab" or "p"

We can probably easily find an alternative solution for using the type attribute on the <p> element, which is not allowed in TEI, but the bigger issue is the structural one: TEI expects us to use either a series of <p> or <ab> elements inside <objectDesc> or the more structural elements like <supportDesc> and <layoutDesc>, not both.

Assigning it to myself to research an alternative.

paregorios commented 10 years ago

Another validation error thrown by the above example is:

value of attribute "writtenLines" is invalid; token "[34," invalid; must be an integer

We presently have a number of strings like "[34, 35]" appearing in the writtenLines attribute on <layout>. TEI demands only an integer value, so presumably we'd need some sort of precision/accuracy encoding. This also needs research.

paregorios commented 10 years ago

We also have content for the columns attribute on <layout> that is invalid in some files, e.g.:

<layoutDesc>
   <layout columns="16+"/>
</layoutDesc>

Again, TEI expects integers only.

Edelweiss commented 9 years ago

One step closer to a valid EpiDoc structure might be achieved by simply removing the @type attribute from the TEI:p tag and by moving the whole tag into the layout section, e.g. TM No. 555 could look like this:

<objectDesc form="fragment">
   <supportDesc>
      <support>
         <material>papyrus</material>
      </support>
   </supportDesc>
   <layoutDesc>
      <layout columns="4" writtenLines="11">
         <p>papyrus fragment (columns: 4, written lines: 11, pagination: 0)</p>
      </layout>
   </layoutDesc>
</objectDesc>
Edelweiss commented 9 years ago

value of attribute "form" is invalid; must be an XML name

<objectDesc form="codex (1 fol.)">
   <p type="bookForm">papyrus codex (1 fol.) (columns: 2, written lines: 20)</p>
   <supportDesc>
      <support>
         <material>papyrus</material>
      </support>
   </supportDesc>
   <layoutDesc>
      <layout columns="2" writtenLines="20"/>
   </layoutDesc>
</objectDesc>

e.g. DCLP TM no. 99549

jcowey commented 9 years ago

List of values for form:

If there is no value for bookForm in the LDAB database, then the attribute is omitted.

jcowey commented 9 years ago

In the example above:

value of attribute "form" is invalid; must be an XML name

<objectDesc form="codex (1 fol.)">
   <p type="bookForm">papyrus codex (1 fol.) (columns: 2, written lines: 20)</p>
   <supportDesc>
      <support>
         <material>papyrus</material>
      </support>
   </supportDesc>
   <layoutDesc>
      <layout columns="2" writtenLines="20"/>
   </layoutDesc>
</objectDesc>

e.g. DCLP TM no. 99549

Clean up the values of form, columns and written lines. Cleaning would involve the removal of spaces, commas, plus marks etc. Thus we would lose information, but we know that all legacy information is contained in the p tag.

It would now look like

<objectDesc form="codex">
   <supportDesc>
      <support>
         <material>papyrus</material>
      </support>
   </supportDesc>
   <layoutDesc>
      <layout columns="2" writtenLines="20">
         <p>papyrus codex (1 fol.) (columns: 2, written lines: 20)</p>
      </layout>
   </layoutDesc>
</objectDesc>
paregorios commented 9 years ago

@jcowey so, do you think we're ready to implement then?

jcowey commented 9 years ago

These have been implemented in https://github.com/DCLP/idp.data/tree/hd with the commit https://github.com/DCLP/idp.data/commit/b2d37ab288441dc74f0cd44c856fbc72c9daa4b7

I would be happy to see this and the other commits https://github.com/DCLP/idp.data/commits/hd of October 2015 folded into our dclp branch https://github.com/DCLP/idp.data/tree/dclp once you have had a look and found the changes acceptable. You will notice that Carmen also added an Authority statement. Once we make the facsimile commit (planned for Wednesday morning) the files should validate - a prerequisite for building the editor. If anything else is unclear please let me know.

paregorios commented 9 years ago

The referenced commit has greatly improved validation performance and brought our content into a more standard TEI convention with regard to object and layout description; however, a number of validation errors relevant to these sections still seem to obtain. These are:

Edelweiss commented 9 years ago

We didn’t take into account that TEI only allows up to two numbers that are separated by whitespace. We were mislead to believe the list could be infinite. We need a new and valid TEI solution for that.

For columns one solution could be to count the number of columns and write this calculated value instead.

1 2 3 4 would become 4 1 2 5 would become 3

For writtenLines: If an attribute contains more than two values we just keep the biggest and the smallest number in the order of their appearance within the original data.

paregorios commented 9 years ago

Sounds good to me!

jcowey commented 8 years ago

carry out necessary fixes using script fix35

Edelweiss commented 8 years ago

example file old

/TEI/teiHeader/fileDesc/sourceDesc/msDesc/physDesc/objectDesc/p[@type="bookForm"]

example file new

/TEI/teiHeader/fileDesc/sourceDesc/msDesc/physDesc/objectDesc/layoutDesc/layout/p
paregorios commented 8 years ago

navigator/@81edcc1 is an attempt to address this change. Note that underlying data in some cases may still leave something to be desired, but the files are all validating and that valid content is being serialized into the HTML. The following examples were considered in the preceding comments:

Over to @jcowey, @rla2118, and @HolgerEssler for review.