digital-preservation / pronom

Simple Maven Artifacts for PRONOM Signatures
http://www.nationalarchives.gov.uk/PRONOM/Default.aspx
Other
8 stars 3 forks source link

Improve identification of XML based formats #13

Open jackdos opened 4 years ago

jackdos commented 4 years ago

There are a number of file formats that are effectively XML of a specific kind. These are currently relying on finding the correct namespace for the XML schema in the bytesequence, often within X bytes of the BOF. This has a couple of issues:

More formally, we generally want to declare that a format is XML with a root node of X, within the namespace Y. It might make sense to stop at identification of fmt/101 for XML, then hand off to some specific XML parsing to assess the root node and namespace criteria.

This might alternatively be solved with something that allows more regex like specifications, including backreferencable groups and greedy/non-greedy matching, and might thus be related to #12, i.e:

[Optional <?xml.. declaration] followed by opening <, followed by (optional prefix) + :, followed by {RootNode}, followed by non-greedy matching of any number of bytes (not including >), followed by "xmlns[optional :\1]={SpecificNamespace}", followed by non-greedy matching of any number of bytes, followed by >

Which hopefully says, "the XML declaration is optional, after that, within the first <> pair, you should have a specific RootNode name, optionally prefixed with an unknown string, and a namespace declaration either of a default, or the unknown prefix string named namespace, where the namespace itself is {SpecificNamespace}".

JonTilbury commented 4 years ago

Attached files demonstrate the problem for GPX (fmt/1134) example1.gpx – (route from Strava) - identified OK example2.gpx – (route from OS Maps) - identified as generic XML example6.gpx – (ride from Strava) - identified as generic XML

GPXexamples.zip

marhop commented 4 years ago

Just in case you missed it, also have a look at @richardlehane's thoughts on this topic and the discussion around them.

jackdos commented 4 years ago

Thanks @marhop I had missed that conversation the first time around. Without taking the time to fully absorb everything, it seems like pushing for some extension of regex-like syntax, with classes and back-referencable groups might allow us to solve a number of issues.