drewnoakes / metadata-extractor

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Apache License 2.0
2.59k stars 484 forks source link

TIFF image data sometimes loaded as byte[] into memory #221

Open drewnoakes opened 7 years ago

drewnoakes commented 7 years ago

@Nadahar reports in #216 that image data is stored in a TIFF tag and therefore is loaded into memory.

For this file we see:

[Exif IFD0 - 0x0117] Strip Byte Counts = 5859030 bytes [Exif IFD0 - 0x8649] Unknown tag (0x8649) = [6380 values] [Exif IFD0 - 0x8773] Inter Color Profile = [3144 values] [Exif IFD0 - 0x935c] Unknown tag (0x935c) = [4542144 values]

Exiftool handles this via:

Image Source Data : (Binary data 4542144 bytes, use -b option to extract)

Most use cases will not require this data to be loaded into memory. It'd be nice to allow controlling this scenario.

We could simply exclude byte[] with greater than a given number of elements from directories.

Alternatively, we introduce a blacklist for known non-metadata tags.

Either way we should probably store an object that represents the unread data such that writing TIFF data doesn't accidentally lose this information.

Should this logic only apply to TIFF files (.tif and many raw formats)? Should it also apply to Exif data, which happens to be encoded using TIFF? For example, Exif data sometimes has large manufacturer sub-IFDs of unknown formatting which are not decoded and currently stored as byte arrays. They're unlikely useful.

If we do filter tags, should that be the default behaviour? Where would this be specified in the API? There's no top-level context/options object. Ideally this would be on the stack rather than a static/global. Would we have a all-encompassing options object passed into ImageMetadataReader, and specific ones for other readers? What about nested formats?

Nadahar commented 7 years ago

I just tested with a local merge of #216 and #225 on my example image. Applying the filter:

new MetadataFilter() {

    @Override
    public boolean tagFilter(Directory directory, int tagType) {
        return !(directory instanceof ExifIFD0Directory) || tagType != 0x935c;
    }

    @Override
    public boolean directoryFilter(Class<? extends Directory> directory) {
        return true;
    }
}

makes the size of the serialized object change to 23130 bytes - which is more in line with what you would expect. Although this might not be true in every case, tag 0x935c is definitly the problem with this one.

Nadahar commented 7 years ago

This has turned out to be a bigger problem than I initially thought. My "image refactoring" PR UniversalMediaServer/UniversalMediaServer#1061 is getting close to finished, and I'm currently doing testing. I have a number of VM's I use for testing under different OS'es. When I try to test this on WinXP 32bit (but it would happen on any 32 bit) I can't test it because I keep running out of memory. I have nothing more more to give the JVM because of 32 bit limitations.

The problem is here. The actual error is thrown in RandomAccessStreamReader.getBytes(), but the cause is the attempted storage of that huge byte array.

java.lang.OutOfMemoryError: Java heap space
    at com.drew.lang.RandomAccessStreamReader.getBytes(RandomAccessStreamReader.java:187)
    at com.drew.imaging.tiff.TiffReader.processTag(TiffReader.java:264)
    at com.drew.imaging.tiff.TiffReader.processIfd(TiffReader.java:224)
    at com.drew.imaging.tiff.TiffReader.processIfd(TiffReader.java:216)
    at com.drew.imaging.tiff.TiffReader.processTiff(TiffReader.java:78)
    at com.drew.imaging.tiff.TiffMetadataReader.readMetadata(TiffMetadataReader.java:75)
    at net.pms.image.ImagesUtil.getMetadata(ImagesUtil.java:1582)

I don't think I can release this without a way to avoid reading those huge chunks, I can't expect everyone to have 16 GB of RAM. Both of the attached files are causing this issue on this test, but there's nothing special about them - many RAW files stores the image data in such a way.

Credo Barcelona Credo 1017

kwhopper commented 7 years ago

I think this could work similar to ExifTool. Specific tags per directory are pre-marked as "binary" and then not read into memory by default if above a certain count (512, 65535, etc). Some option could be passed in ME that ignores this behavior, which doesn't seem too out of line to me (for now).

ExifTool typically writes a specific string to replace the tag's binary value (like "Binary data {length} bytes" or somesuch) and looks for that string when displaying a byte tag. We could do this without inspection by adding some properties to Tag if you want to do it that way, although that's a bigger change.

I don't think a PR would be too difficult if you want me to try it. It seems best to do it on a specific tag-by-tag basis in each directory as needed instead of some global default. Thoughts?

kwhopper commented 7 years ago

... a PR in the .NET project that is. I'm not as fast in the Java project.

Nadahar commented 7 years ago

I think @kwhopper's approach seems good. I'd choose to modify Tag with a flag and then store the size in the Tag value. Cleaner and faster than using String and inspection.

If @kwhopper makes a PR for .NET I might be able to "translate" it to Java and create a PR.