Open sbshep opened 7 months ago
Hi Scott, this came up a few times in the past. Adding more range to the EOF offset is an option, but many folks keep the default 65536 maximum byte scan setting in DROID. This definitely needs to be addressed as we see this more and more. This change would affect multiple PUID's, some of them might never have this padding.
The EXIF specs state that any JPG render should ignore anything after the last FFD9 marker. So we should find something similar for PRONOM.
I discussed a bit and linked to samples here: https://preservation.tylerthorsted.com/2023/06/23/jpg-structure/
If you aggregate all the PRONOM signatures, the ceiling for EOF scanning is currently 131084:
Making this offset larger than that would raise that ceiling and mean doing much larger end-of-file reads for all file types.
You've got another option which is to define additional signatures where the FFD9 marker is anchored to the beginning of the file but at a wild card offset. e.g. FFD8FFE1{2}4578696600004D4D002A*900000070000000430323231*FFD9
In terms of changing the default setting in DROID, I definitely agree. The benchmarks I run show that the impact should be tolerable (DROID still works really well with a -1 setting): https://www.itforarchivists.com/siegfried/benchmarks
Richard's suggesting of anchoring FFD9 at the end of the BOF signature might work best. Two concerns:
I echo Tyler's questions. I like the idea of anchoring FFD9 from the BOF, but what happens if there is more than one FFD9 in a file? Does that pose a problem?
Any further thoughts on how to handle these JPG files with extra data after the final FFD9? As I've experimented with anchoring to the BOF, I find that it may identify files that wouldn't identify normally, but since we're using JHOVE as a validation of these files afterward, I'm not so worried about missing a file that has a problem. Is anchoring to the BOF a viable option for everyone?
Would love to discuss this issue further, just received a USB drive with hundreds of JPG files which are not identified as they have large padding at the end. They are iPhone images.
We are getting a lot of .JPG files from modern camera phones that add a lot of zeroes after the final FFD9, so many as to exceed the maximum EOF offset. The result is that DROID doesn't identify those files.
The existing JPG signature files have a maximum EOF offset of 16000 or 65536 or 131072. Were those offsets chosen for a specific reason? Is there any reason the offset couldn't be extended much higher to account for the extra padding in these modern files?
As a test, I created a signature that matches fmt/645 but I increased the maximum EOF offset to 999999999 (note: I found that going higher by adding even one more 9 resulted in DROID failing to load the profile). I ran a sample file through and it identified it correctly. There is apparently a ceiling past which the profile won't load, but even 999999999 should be sufficient, I think.
Any thoughts or experience with increasing the maximum EOF offset? I've attached a sample .jpg that identifies as fmt/645 if you either remove the padding or increase the maximum EOF offset.
fmt645 if you remove padding.JPG.zip