drewnoakes / metadata-extractor

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Apache License 2.0
2.59k stars 480 forks source link

FileTypeDetector can not detect some jpegs. #449

Closed rdonuk closed 4 years ago

rdonuk commented 4 years ago

Hello, I am using the latest version. Below code firing an exception.

url = new URL("https://scontent-ort2-2.cdninstagram.com/v/t51.2885-15/e35/s1080x1080/65964416_289848805148001_2882766301193014775_n.jpg?_nc_ht=scontent-ort2-2.cdninstagram.com&_nc_cat=109&oh=b7a6f10c1c0a70531559af62e8c60580&oe=5E77ED20");
InputStream in = url.openStream();
Metadata metadata = ImageMetadataReader.readMetadata(in);

Exception is:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3
    at com.drew.metadata.mp3.Mp3Reader.extract(Mp3Reader.java:105)
    at com.drew.imaging.mp3.Mp3MetadataReader.readMetadata(Mp3MetadataReader.java:58)
    at com.drew.imaging.ImageMetadataReader.readMetadata(ImageMetadataReader.java:180)
    at com.drew.imaging.ImageMetadataReader.readMetadata(ImageMetadataReader.java:125)
    at com.drew.imaging.ImageMetadataReader.readMetadata(ImageMetadataReader.java:104)
    at es.trendit.tools.exif.ImageMetadataExtractor.main(ImageMetadataExtractor.java:166)

So it seems FileTypeDetector thinks, that jpeg is an mp3. Any thoughts?

drewnoakes commented 4 years ago

JPEG detection currently looks for the two opening bytes FF D8.

MP3 detection currently looks for a single opening byte FF.

The image you provided is successfully detected as JPEG using the latest code on master. What version of the library are you using?

It seems that MP3 detection could be made more robust as well. Wikipedia suggests that MP3's start with FF FB or 49 44 33 (ID3 in ASCII).

payton commented 4 years ago

@drewnoakes Agreed that MP3 detection could be more robust. I'm not sure what the best way to do that would be without some form of pre-processing before file type detection.

In this diagram from the wiki, you can see the fourth byte contains configuration information about the file and may vary based on a few factors. ID3 is a also a metadata format that, although primarily used for MP3, may be used with other formats. Identifying a file based on ID3 may not be accurate unless we can confirm by first reading through the ID3 block.

rdonuk commented 4 years ago

Hi @drewnoakes, I am using the latest release: 2.12. But I tried same image with the newest code and I am still getting mp3 as filetype. Am I doing anything wrong?

urlString = "https://scontent-ort2-2.cdninstagram.com/v/t51.2885-15/e35/s1080x1080/65964416_289848805148001_2882766301193014775_n.jpg?_nc_ht=scontent-ort2-2.cdninstagram.com&_nc_cat=109&oh=b7a6f10c1c0a70531559af62e8c60580&oe=5E77ED20https://scontent-ort2-2.cdninstagram.com/v/t51.2885-15/e35/s1080x1080/65964416_289848805148001_2882766301193014775_n.jpg?_nc_ht=scontent-ort2-2.cdninstagram.com&_nc_cat=109&oh=b7a6f10c1c0a70531559af62e8c60580&oe=5E77ED20https://scontent-ort2-2.cdninstagram.com/v/t51.2885-15/e35/s1080x1080/65964416_289848805148001_2882766301193014775_n.jpg?_nc_ht=scontent-ort2-2.cdninstagram.com&_nc_cat=109&oh=b7a6f10c1c0a70531559af62e8c60580&oe=5E77ED20";
URL url = new URL(urlString);
InputStream in = url.openStream();
System.out.println(FileTypeDetector.detectFileType(new BufferedInputStream(in)));
drewnoakes commented 4 years ago

@payton that diagram suggests using FF FB as well, not just FF. It seems like we can improve this.

@rdonuk it looks like there is a bug in FileTypeDetector.detectFileType. The call to inputStream.read can read fewer bytes than needed, yet processing continues. I'll take a look at this and see if I can fix both issues.

drewnoakes commented 4 years ago

I've opened a PR that I think fixes this. Reviews appreciated.

rdonuk commented 4 years ago

Thank you @drewnoakes it seems your change is solving my problem. Do you have any plans for a new release?