Open drewnoakes opened 7 years ago
I just tested with a local merge of #216 and #225 on my example image. Applying the filter:
new MetadataFilter() {
@Override
public boolean tagFilter(Directory directory, int tagType) {
return !(directory instanceof ExifIFD0Directory) || tagType != 0x935c;
}
@Override
public boolean directoryFilter(Class<? extends Directory> directory) {
return true;
}
}
makes the size of the serialized object change to 23130 bytes - which is more in line with what you would expect. Although this might not be true in every case, tag 0x935c
is definitly the problem with this one.
This has turned out to be a bigger problem than I initially thought. My "image refactoring" PR UniversalMediaServer/UniversalMediaServer#1061 is getting close to finished, and I'm currently doing testing. I have a number of VM's I use for testing under different OS'es. When I try to test this on WinXP 32bit (but it would happen on any 32 bit) I can't test it because I keep running out of memory. I have nothing more more to give the JVM because of 32 bit limitations.
The problem is here. The actual error is thrown in RandomAccessStreamReader.getBytes()
, but the cause is the attempted storage of that huge byte array.
java.lang.OutOfMemoryError: Java heap space
at com.drew.lang.RandomAccessStreamReader.getBytes(RandomAccessStreamReader.java:187)
at com.drew.imaging.tiff.TiffReader.processTag(TiffReader.java:264)
at com.drew.imaging.tiff.TiffReader.processIfd(TiffReader.java:224)
at com.drew.imaging.tiff.TiffReader.processIfd(TiffReader.java:216)
at com.drew.imaging.tiff.TiffReader.processTiff(TiffReader.java:78)
at com.drew.imaging.tiff.TiffMetadataReader.readMetadata(TiffMetadataReader.java:75)
at net.pms.image.ImagesUtil.getMetadata(ImagesUtil.java:1582)
I don't think I can release this without a way to avoid reading those huge chunks, I can't expect everyone to have 16 GB of RAM. Both of the attached files are causing this issue on this test, but there's nothing special about them - many RAW files stores the image data in such a way.
I think this could work similar to ExifTool. Specific tags per directory are pre-marked as "binary" and then not read into memory by default if above a certain count (512, 65535, etc). Some option could be passed in ME that ignores this behavior, which doesn't seem too out of line to me (for now).
ExifTool typically writes a specific string to replace the tag's binary value (like "Binary data {length} bytes" or somesuch) and looks for that string when displaying a byte tag. We could do this without inspection by adding some properties to Tag if you want to do it that way, although that's a bigger change.
I don't think a PR would be too difficult if you want me to try it. It seems best to do it on a specific tag-by-tag basis in each directory as needed instead of some global default. Thoughts?
... a PR in the .NET project that is. I'm not as fast in the Java project.
I think @kwhopper's approach seems good. I'd choose to modify Tag
with a flag and then store the size in the Tag
value. Cleaner and faster than using String
and inspection.
If @kwhopper makes a PR for .NET I might be able to "translate" it to Java and create a PR.
@Nadahar reports in #216 that image data is stored in a TIFF tag and therefore is loaded into memory.
For this file we see:
Exiftool handles this via:
Most use cases will not require this data to be loaded into memory. It'd be nice to allow controlling this scenario.
We could simply exclude
byte[]
with greater than a given number of elements from directories.Alternatively, we introduce a blacklist for known non-metadata tags.
Either way we should probably store an object that represents the unread data such that writing TIFF data doesn't accidentally lose this information.
Should this logic only apply to TIFF files (.tif and many raw formats)? Should it also apply to Exif data, which happens to be encoded using TIFF? For example, Exif data sometimes has large manufacturer sub-IFDs of unknown formatting which are not decoded and currently stored as byte arrays. They're unlikely useful.
If we do filter tags, should that be the default behaviour? Where would this be specified in the API? There's no top-level context/options object. Ideally this would be on the stack rather than a static/global. Would we have a all-encompassing options object passed into
ImageMetadataReader
, and specific ones for other readers? What about nested formats?