buzz / mediainfo.js

Extract media file metadata in the browser using WebAssembly.
https://mediainfo.js.org
BSD 2-Clause "Simplified" License
689 stars 107 forks source link

Character encoding issue #150

Closed icidasset closed 5 months ago

icidasset commented 7 months ago

Checklist

Bug Description

The metadata of an MP3 file has character encoding issues. It's listing the artist as �me instead of Âme

Steps to Reproduce

const mediainfo = await MediaInfoFactory({
    coverData: covers,
    locateFile: () => {
      return "../../wasm/media-info.wasm";
    },
})

const result = await mediainfo.analyzeData(
    getSize(headUrl),
    readChunk(getUrl)
)

Where getSize and readChunk is: https://github.com/icidasset/diffuse/blob/b11d5b0a8d204ed3db401c0eedb8da25e4076131/src/Javascript/processing.ts#L116-L155

Using the file: https://www.dropbox.com/scl/fi/u7qw9rgo7xjmcajsh1oex/211-me_-_rej.mp3?rlkey=tmwckk19p4n7r7gd5g8ngk7cz&dl=0

You can try it out in the app itself using this branch: https://github.com/icidasset/diffuse/tree/encoding-issue-mediainfo

You'll have to upload the file to a supported service in order to test it though. I can do that for you if you want, let me know.

Expected Behavior

Expected to see the artist Âme

Actual Behavior

Got the artist �me

Environment

Additional Information

I tested this with the mediainfo CLI tool installed via homebrew where it did parse the metadata correctly, uses MediaInfoLib v24.01. Other metadata parsers show the info correctly too.

Thanks for this great project!

buzz commented 7 months ago

That seems to be an old file that uses latin-1 encoding.

``` $ mid3v2 --list-raw 211-me_-_rej.mp3 Raw IDv2 tag info for 211-me_-_rej.mp3 [...] TPE1(encoding=, text=['Âme']) [...] ``` When I open the file in a tag editor and recreate the tags, mediainfo.js shows them properly: Before: ``` $ mediainfo.js 211-me_-_rej.mp3 | grep Performer Performer : �me ``` After: ``` $ mediainfo.js 211-me_-_rej.mp3 | grep Performer Performer : Âme ``` Interestingly the medianfo.js web page shows them correctly. Modern browsers usually auto-detect the encoding. ![Screenshot_2024-02-12_04-33-35](https://github.com/buzz/mediainfo.js/assets/12035/68e170d3-87c5-460c-89ee-8647e03d6a05)

I think the mediainfo CLI converts those strings to UTF-8 before printing them to the screen. A lot of flags are disabled for the emscripten build. You could test by building mediainfo.js with those flags enabled.

Another approach would be to solve this outside of mediainfo.js. There's are libs that detect and convert between encodings. Or even make UTF-8 mandatory as character encoding in your app.

icidasset commented 7 months ago

Thanks for investigating! 🙏

So weird that it does work in your demo. I guess the difference is that I am reading chunks using the fetch API (file is hosted on Amazon S3) and the Range header, whereas your demo reads the whole file using FileReader. And the demo is an older version of mediainfo. Maybe I need to set some encoding option/header somewhere?

even make UTF-8 mandatory as character encoding in your app.

I have <meta charset="utf-8" /> set in my html file, I assume that's all I need?

I also tried converting to UTF8 manually, but no luck. I'll take a look at the different mediainfo/emscripten flags.

JeromeMartinez commented 7 months ago

When I open the file in a tag editor and recreate the tags, mediainfo.js shows them properly

Could you share the resulting file? Because I don't get how it could be different. Maybe the they change the encoding to UTF-8? But MediaInfoLib should manage the difference and show the same MediaInfo::Get() output. Could you have a debug on MediaInfo::Get() result and output the hex dump of the output from the 2 files?

TPE1(encoding=<Encoding.LATIN1: 0>, text=['Âme'])

MediaInfoLib parses it as Latin1 too and provides the string in Unicode or UTF-8, so no reason that mediainfo.js shows "�me" with this file on the command line.

So weird that it does work in your demo. I guess the difference is that I am reading chunks using the fetch API (file is hosted on Amazon S3) and the Range header, whereas your demo reads the whole file using FileReader

Input kind should not change MediaInfoLib behavior.

I think the mediainfo CLI converts those strings to UTF-8 before printing them to the screen.

It converts to local code page. But if the OS is misconfigured without e.g. LC_CTYPE to e.g. "en_US.UTF-8" it may convert internal Unicode to C locale (so Latin1... but from a UTF-8 string, so issue). Maybe similar issue with mediainfo.js?

A lot of flags are disabled for the emscripten build.

Should not have any impact. There is a known issue with modern platforms (having UTF-8 encoding for the terminal) and such MP3 files with Latin1 encoding only if MediaInfoLib is compiled in non Unicode mode (MEDIAINFO_UNICODE_NO).

If possible, a first point of debug would be the result of MediaInfo::Get() and see how it is encoded (Unicode chars? UTF-8?) so you know if the issue is from MediaInfoLib or something else (mediainfo.js or the platform config somewhere not handling UTF-8).

icidasset commented 7 months ago

Thanks for chiming in @JeromeMartinez I'd love to help out more with this but I'm having a difficult time getting the project to build 🙈


Anyhow! I did figure out that it seems to be a problem with mediainfo.js v0.2 I've updated the demo from the gh-pages-src branch to v0.2.1 (from v0.1.9) and then the encoding problem showed up there too: Screenshot 2024-02-12 at 15 17 12 So I guess we can exclude any app specific code.

buzz commented 7 months ago

I created a mp3 file with different encodings. Most media players and tag editors display the tags just fine (i.e. EasyTAG fails to display the UTF-16BE tag).

A test case and the mp3 test file can be found in the 150-id3-character-encodings branch.

Python script used to create the test file ```python """ Create id3 tags in different character encodings. $ mid3v2 --list-raw char_enc_tags.mp3 Raw IDv2 tag info for char_enc_tags.mp3 TIT2(encoding=, text=['utf-8 〃𐍈']) TPE1(encoding=, text=['latin-1 ãâ¬Æ']) TALB(encoding=, text=['utf-16 〃𐍈']) TCON(encoding=, text=['utf-16be 〃𐍈']) """ from mutagen.id3 import Encoding, ID3, TALB, TCON, TIT2, TPE1 tags = ID3() # performer tags.add(TPE1(encoding=Encoding.LATIN1, text=["latin-1 ãâ¬Æ"])) # title tags.add(TIT2(encoding=Encoding.UTF8, text=["utf-8 〃𐍈"])) # album tags.add(TALB(encoding=Encoding.UTF16, text=["utf-16 〃𐍈"])) # genre tags.add(TCON(encoding=Encoding.UTF16BE, text=["utf-16be 〃𐍈"])) tags.save("char_enc_tags.mp3") ```

Findings for different flavors of mediainfo:

✅ https://mediainfo.js.org/ - mediainfo.js v0.1.9 (MediaInfoLib v22.09) ![mediainfo js org_v0 1 9](https://github.com/buzz/mediainfo.js/assets/12035/ec76d79b-03d8-4c75-bef0-a9f8761e9b73)
✅ mediainfo.js CLI v0.1.9 (MediaInfoLib v22.09) ```shell $ node dist/cli.js --format JSON ../char_encoding_issue_150/char_enc_tags.mp3 ``` ```json "Title": "utf-8 〃𐍈", "Album": "utf-16 〃𐍈", "Track": "utf-8 〃𐍈", "Performer": "latin-1 ãâ¬Æ", "Genre": "utf-16be 〃𐍈", ```
❌ mediainfo.js CLI v0.2.1 (MediaInfoLib v24.01) ```shell $ pnpm exec node dist/cjs/cli.cjs --format JSON __tests__/fixtures/char_enc_tags.mp3 ``` ```json "Title":"utf-8 〃𐍈", "Album":{"@dt":"binary.base64","#value": "dXRmLTE2IMOj4hqsxhk="}, "Track":"utf-8 〃𐍈", "Performer":"latin-1 ã��", "Genre":{"@dt":"binary.base64","#value": "dXRmLTE2YmUgw6PiGqzGGQ=="}, ```
❌ https://mediaarea.net/MediaInfoOnline - Mediainfo official WASM build (MediaInfoLib v24.01) ```json "Title":"utf-8 〃𐍈", "Album":{"@dt":"binary.base64","#value": "dXRmLTE2IMOj4hqsxhk="}, "Track":"utf-8 〃𐍈", "Performer":"latin-1 ã��", "Genre":{"@dt":"binary.base64","#value": "dXRmLTE2YmUgw6PiGqzGGQ=="}, ```
❌ mediainfo CLI (MediaInfoLib v24.01) Displays the Latin-1 tag correctly, but not the UTF-16/UTF-16BE... ```shell $ mediainfo __tests__/fixtures/char_enc_tags.mp3 [...] Album : utf-16 〃?? Track name : utf-8 〃𐍈 Performer : latin-1 ãâ¬Æ Genre : utf-16be 〃?? [...] ```

These are just some preliminary findings. I'll have to see when I find the time to do some version bisecting on mediainfo.js.

icidasset commented 7 months ago

Thanks @buzz I managed to make a build with a Github action. I've removed the --disable-unicode flag from the zenlib compilation, which I assume also put mediainfolib in non-unicode mode, and added the LC_CTYPE=en_US.UTF-8 env var. But... no luck.

JeromeMartinez commented 7 months ago

https://mediaarea.net/MediaInfoOnline - Mediainfo official WASM build (MediaInfoLib v24.01) ❌ mediainfo CLI (MediaInfoLib v24.01)

This is annoying... Weird, the Windows version is fine, maybe some misconfiguration somewhere about locale. Note that MediaInfo CLI v22.09 has also the issue, so it seems that there is no change there on our side.

IMO 2 issues:

github-actions[bot] commented 6 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 30 days since being marked as stale.