Open lblatchford opened 3 years ago
It looks like that is a COM field in the jpeg file, which is used for comments. For this field, the JPEG specification says "the interpretation is left to the application". So it seems there is no standard encoding for this field, and this likely applies to all other string fields in a jpeg file. To preserve the comment and other field data exactly, it probably makes sense to change the encoding to ISO-8859-1
. This encoding has no illegal values (which both US-ASCII and UTF-8 have) and so comment data will never be replaced due to encoding errors--this should allow parsing and unparsing exaclty the same. It does mean daffodil won't detect garbage/malicious comment data, but that can be handled outside of daffodil if it's important to the use case.
Would you like to create a pull request switching to ISO-8859-1?
dirtyword5x.jpg has a non-ASCII byte 0xa8 at offset 0x1a1b. test25.jpg has the same non-ASCII byte at offset 0x17ef.
When these files are parsed and then unparsed, the 0xa8 becomes 0x3f. After the parse, the infoset has the 0xa8 translated to 0xEFBFBD, the UTF-8 replacement character. If the encoding in the schema is changed from US-ASCII to UTF-8, 0xa8 is changed to 0xEFBFBD in the final JPEG, which is more clear.
Should the encoding be UTF-8 or something else other than US-ASCII?