digital-preservation / PRONOM_Research

28 stars 10 forks source link

Consider adding xhtml file extension to XHTML records fmt/103 fmt/102 #78

Open anjackson opened 2 months ago

anjackson commented 2 months ago

Poking around in the file extensions data I've got from various places, I noticed that the XHTML PRONOM records (fmt/103, fmt/102) do not include xhtml as a possible file extension. The IANA Media Type registration for XHTML says:

File extension(s) : "xhtml" and "xht" are sometimes used.

Amusingly, it also says:

Magic number(s) : No sequence of bytes can uniquely identify an XHTML 
document. More information on detecting XHTML documents is available in 
the MIME Sniffing specification.

While extension-based matching is less than ideal, perhaps it's still worth adding the above file extensions as fallbacks?

In case it helps, here's a comparison with other format info sources: *.xht, *.xhtml