Closed paregorios closed 8 years ago
Another validation error thrown by the above example is:
value of attribute "writtenLines" is invalid; token "[34," invalid; must be an integer
We presently have a number of strings like "[34, 35]" appearing in the writtenLines
attribute on <layout>
. TEI demands only an integer value, so presumably we'd need some sort of precision/accuracy encoding. This also needs research.
We also have content for the columns
attribute on <layout>
that is invalid in some files, e.g.:
<layoutDesc>
<layout columns="16+"/>
</layoutDesc>
Again, TEI expects integers only.
One step closer to a valid EpiDoc structure might be achieved by simply removing the @type
attribute from the TEI:p
tag and by moving the whole tag into the layout section, e.g. TM No. 555 could look like this:
<objectDesc form="fragment">
<supportDesc>
<support>
<material>papyrus</material>
</support>
</supportDesc>
<layoutDesc>
<layout columns="4" writtenLines="11">
<p>papyrus fragment (columns: 4, written lines: 11, pagination: 0)</p>
</layout>
</layoutDesc>
</objectDesc>
value of attribute "form" is invalid; must be an XML name
<objectDesc form="codex (1 fol.)">
<p type="bookForm">papyrus codex (1 fol.) (columns: 2, written lines: 20)</p>
<supportDesc>
<support>
<material>papyrus</material>
</support>
</supportDesc>
<layoutDesc>
<layout columns="2" writtenLines="20"/>
</layoutDesc>
</objectDesc>
e.g. DCLP TM no. 99549
List of values for form:
If there is no value for bookForm in the LDAB database, then the attribute is omitted.
In the example above:
value of attribute "form" is invalid; must be an XML name
<objectDesc form="codex (1 fol.)">
<p type="bookForm">papyrus codex (1 fol.) (columns: 2, written lines: 20)</p>
<supportDesc>
<support>
<material>papyrus</material>
</support>
</supportDesc>
<layoutDesc>
<layout columns="2" writtenLines="20"/>
</layoutDesc>
</objectDesc>
e.g. DCLP TM no. 99549
Clean up the values of form, columns and written lines. Cleaning would involve the removal of spaces, commas, plus marks etc. Thus we would lose information, but we know that all legacy information is contained in the p tag.
It would now look like
<objectDesc form="codex">
<supportDesc>
<support>
<material>papyrus</material>
</support>
</supportDesc>
<layoutDesc>
<layout columns="2" writtenLines="20">
<p>papyrus codex (1 fol.) (columns: 2, written lines: 20)</p>
</layout>
</layoutDesc>
</objectDesc>
@jcowey so, do you think we're ready to implement then?
These have been implemented in https://github.com/DCLP/idp.data/tree/hd with the commit https://github.com/DCLP/idp.data/commit/b2d37ab288441dc74f0cd44c856fbc72c9daa4b7
I would be happy to see this and the other commits https://github.com/DCLP/idp.data/commits/hd of October 2015 folded into our dclp branch https://github.com/DCLP/idp.data/tree/dclp once you have had a look and found the changes acceptable. You will notice that Carmen also added an Authority statement. Once we make the facsimile commit (planned for Wednesday morning) the files should validate - a prerequisite for building the editor. If anything else is unclear please let me know.
The referenced commit has greatly improved validation performance and brought our content into a more standard TEI convention with regard to object and layout description; however, a number of validation errors relevant to these sections still seem to obtain. These are:
We didn’t take into account that TEI only allows up to two numbers that are separated by whitespace. We were mislead to believe the list could be infinite. We need a new and valid TEI solution for that.
For columns one solution could be to count the number of columns and write this calculated value instead.
1 2 3 4 would become 4 1 2 5 would become 3
For writtenLines: If an attribute contains more than two values we just keep the biggest and the smallest number in the order of their appearance within the original data.
Sounds good to me!
carry out necessary fixes using script fix35
/TEI/teiHeader/fileDesc/sourceDesc/msDesc/physDesc/objectDesc/p[@type="bookForm"]
/TEI/teiHeader/fileDesc/sourceDesc/msDesc/physDesc/objectDesc/layoutDesc/layout/p
navigator/@81edcc1 is an attempt to address this change. Note that underlying data in some cases may still leave something to be desired, but the files are all validating and that valid content is being serialized into the HTML. The following examples were considered in the preceding comments:
Over to @jcowey, @rla2118, and @HolgerEssler for review.
The recently implemented solution for more prosey object description was to introduce an initial typed paragraph for use in display and then to follow it with
<supportDesc>
etc. Here's an example:Unfortunately, this example throws a number of validation errors against the EpiDoc schema, all of which are exemplary of similar problems across the entire dataset:
We can probably easily find an alternative solution for using the
type
attribute on the<p>
element, which is not allowed in TEI, but the bigger issue is the structural one: TEI expects us to use either a series of<p>
or<ab>
elements inside<objectDesc>
or the more structural elements like<supportDesc>
and<layoutDesc>
, not both.Assigning it to myself to research an alternative.