Hint for using read_exif, read_xmp, read_iptc, read_comment

LeoHsiao1 / pyexiv2

Read and write image metadata, including EXIF, IPTC, XMP, ICC Profile.

GNU General Public License v3.0

206 stars 39 forks source link

Hint for using read_exif, read_xmp, read_iptc, read_comment #151

Closed RalfPeter closed 1 month ago

RalfPeter commented 1 month ago

Good morning, "For all those who have encountered difficulties using the functions read_exif, read_xmp, and read_iptc. From time to time, my script crashed with a runtime error without any further error message. I suspected the cause was in the coding of exiv2 and experimented with different images. I discovered that some images contained metadata that was not UTF-8 encoded, but rather ISO-8859-1. So I wrote the following routines (analogously, of course, for XMP, IPTC, and comments as well). Perhaps someone will find it helpfull:

    # -------------------------------------------------------------------------------------------
    def _log_unicode_decode_details(self, e: UnicodeDecodeError, verbose=False):
        error_details = [e.encoding, e.object, e.start, e.end, e.reason]
        # Übergabe an die log-Funktion
        if verbose:
            log('EXIF unicode_decode', f"Fehler im Foto {self.file.name}: Typ: {type(e).__name__}", *error_details, level=ERROR)

    # -------------------------------------------------------------------------------------------
    def _read_exif(self):
        # encoding standard is utf-8, but could be iso-8859-1
        try:
            data = self.image.read_exif(encoding=ENCODING_UTF)
        except UnicodeDecodeError as e:
            # Übergabe an die log-Funktion
            self._log_unicode_decode_details(e)
            try:
                data = self.image.read_exif(encoding=ENCODING_ISO)
            except UnicodeDecodeError as e:
                # Übergabe an die log-Funktion
                self._log_unicode_decode_details(e, verbose=True)
                raise UnicodeDecodeError

        return data

LeoHsiao1 commented 1 month ago

Thank you for sharing. ISO-8859-1 is a special encoding format that uses all encoded values in the range 0x00 ~ 0xFF. Therefore, bytes in any encoded format can be decoded with ISO-8859-1 format, but the original meaning may be lost. For example:

Suppose you encode a Chinese string:
```
>>> data = '你好'.encode('gbk')
```

It can not be decoded with utf-8:

>>> data.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

If you know that the data are encoded with gbk format, you can decode it with the same format:
```
>>> data.decode('gbk')
'你好'
```
You can decode it with ISO-8859-1 format, which won't report errors, but you won't get the original Chinese characters.
```
>>> data.decode('ISO-8859-1') 
'ÄãºÃ'
```

In conclusion, ISO-8859-1 is not a one-size-fits-all encoding format. Users are advised to find the original encoding format of the data.

LeoHsiao1 commented 1 month ago

I am closing this issue now. Hopefully others with the same problem will search for this issue.

RalfPeter commented 1 month ago

Thank you Leo, my only idea was that anybody with the same problem will have an advice how to use encoding with your pyexiv2. Thank you for your additional comment.