properly strip trailing null characters (\0) from tag data

drewnoakes / metadata-extractor

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files

Apache License 2.0

2.57k stars 481 forks source link

properly strip trailing null characters (\0) from tag data #636

Closed gtiwari333 closed 8 months ago

gtiwari333 commented 10 months ago

A change from https://github.com/drewnoakes/metadata-extractor/pull/634/files#diff-7df436f0258b5156fd15792db94b4cab6119180ad7d605533ffb53213f048ba2R400 got missed in earlier merge.

https://github.com/drewnoakes/metadata-extractor/pull/635

gtiwari333 commented 10 months ago

This PR will fix the extraction of picture control name that's coming up with trailing 0s.

drewnoakes commented 10 months ago

Thanks for following up here. I intentionally changed your last PR before merging to avoid the approach here though. Let's see if there's a different approach that'll work for you.

When pulling strings from bytes, we generally store StringValue rather than String. This allows the consumer to reinterpret the underlying data in different encodings. If we convert to string, that data is lost.

The problem you're seeing is that StringValue is not trimming trailing null bytes. I wonder if there's a better way to deal with that.

drewnoakes commented 10 months ago

I ported the fix to .NET as well here: https://github.com/drewnoakes/metadata-extractor-dotnet/pull/349

drewnoakes commented 10 months ago

And CI is fixed again. Sorry about that.

gtiwari333 commented 10 months ago

Instead of adding getNullTerminatedStringAndSkipToNextPosition, we can call getNullTerminatedStringValue

There's a problem with getNullTerminatedStringValue. After reading the requested number of bytes from input, it doesn't move the cursor to the end. My test testGetNullTerminatedStringCursorPositionTest explains what's wrong with getNullTerminatedStringValue method. Only way to get around it is to check the size of text returned and manually skip the remaining bytes.

For input AB\0\0\0CD\0ABCDEF

If you ask for getNullTerminatedString(5) it will return AB but keep the cursor position at index 3.

The new method getNullTerminatedStringAndSkipToNextPosition reads the AB and move the cursor at C.

If you compare testGetNullTerminatedStringCursorPositionTest and testGetNullTerminatedStringAndSkipToNextPosition you will see the improvement the new method makes.

gtiwari333 commented 10 months ago

Did few more testing and found an easy way to deal with it. Let me know what would you prefer?

The getStringValue reads the all 20 bytes and moves the cursor to the end. If we trim the value using String.trim(), it will remove the trailing null characters.

directory.setString(TAG_PICTURE_CONTROL_NAME, reader.getStringValue(20, Charsets.UTF_8).toString())

Change this to:

directory.setString(TAG_PICTURE_CONTROL_NAME, reader.getStringValue(20, Charsets.UTF_8).toString().trim())

drewnoakes commented 8 months ago

I ended up fixing this in #647 with an approach that mirrors what we did in the .NET version, so that they're as close as possible (which helps porting other changes in future). It also avoids adding toString calls to reduce allocations and allow consumers to reinterpret the StringValue in other encodings if needed.

drewnoakes commented 8 months ago

Oops. Thanks!

Nadahar commented 8 months ago

@drewnoakes I deleted my comment because I assumed you would edit yours - but ended up making it make no sense at all 😊