digital-preservation / PRONOM_Research

28 stars 10 forks source link

Need to tighten fmt/1776 - XML 1.1 signature - currently causes clashes #21

Closed Dclipsham closed 1 year ago

Dclipsham commented 1 year ago

The attached files are SVG 1.1, but are getting dual identification with fmt/1776 - XML 1.1. They are not XML 1.1, they are XML 1.0, but the liberal signature for XML 1.1 is picking up the 'svg version=1.1' string and interpreting that as part of the XML version identifier. The files are not additionally getting identification as fmt/101 XML 1.0 because fmt/92 SVG 1.1 has priority over fmt/101.

unavail.svg up.svg wait.svg

Currently the XML 1.1 signature is 3C3F786D6C(20|09){0-30}76657273696F6E{0-30}3D{0-30}(22|27)312E31(22|27). This translates as: <?xml([space]|[horizontal tab]){0-30}version{0-30}={0-30}('|")1.1('|")

Regarding the three {0-30} wildcard blocks: The XML 1.1 specification, section 2.8 (https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-prolog-dtd) states "XML 1.1 documents must begin with an XML declaration which specifies the version of XML being used"

The specification also states that the VersionInfo part of the XML declaration of the prolog is as follows: VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"') and that 'Eq' can have whitespace characters either side of the equals character: Eq ::= S? '=' S?

Although white space in the XML 1.1 specification is allowed to consist of one or more space/tab characters, in practice for a string like the XML declaration, you're very unlikely to see something like <?xml version = "1.1", and the main use for liberal whitespacing in XML is to deal with indentation (e.g. pretty print styles). You're also unlikely to see a tab character between 'xml' and 'version' but I'm more relaxed about that one...

The prolog does allow for miscellaneous comments, but these appear after the XML declaration, not within or before it.

If there are file examples available where the version declaration does appear later on, I'm curious to see if this is because of a specifc subtype/interpretation of XML 1.1 that doesn't adhere to the formal standard?

Otherwise, I would recommend the sequence is amended to the following: 3C3F786D6C(20|09)76657273696F6E{0-1}3D{0-1}(22|27)312E31(22|27)

Further, fmt/1776 has been given a subtype relationship against fmt/102 Extensible Hypertext Markup Language 1.0. This is incorrect, nor is it a subtype of fmt/101 XML 1.0. I believe the intended relationship here should be 'Is subsequent version of [fmt/101] Extensible Markup Language 1.0'

Many thanks, David

tnafrancesca commented 1 year ago

Taking a look at the original request and the sample files we have plus the documentation you are correct the signature could be tighter. I have put in what you suggested and will upload a signature for you to test out later today. The issue with the relationship was a copying error, thank you for letting us know! If the new signature looks good then I'll close this issue

tnafrancesca commented 1 year ago

Hi David, Do the changes in v.114 mean we can now close this request?

tnafrancesca commented 1 year ago

Believe this has now been resolved- if not we can re-open at a later date