Closed icidasset closed 5 months ago
That seems to be an old file that uses latin-1 encoding.
I think the mediainfo CLI converts those strings to UTF-8 before printing them to the screen. A lot of flags are disabled for the emscripten build. You could test by building mediainfo.js with those flags enabled.
Another approach would be to solve this outside of mediainfo.js. There's are libs that detect and convert between encodings. Or even make UTF-8 mandatory as character encoding in your app.
Thanks for investigating! 🙏
So weird that it does work in your demo. I guess the difference is that I am reading chunks using the fetch
API (file is hosted on Amazon S3) and the Range
header, whereas your demo reads the whole file using FileReader
. And the demo is an older version of mediainfo. Maybe I need to set some encoding option/header somewhere?
even make UTF-8 mandatory as character encoding in your app.
I have <meta charset="utf-8" />
set in my html file, I assume that's all I need?
I also tried converting to UTF8 manually, but no luck. I'll take a look at the different mediainfo/emscripten flags.
When I open the file in a tag editor and recreate the tags, mediainfo.js shows them properly
Could you share the resulting file?
Because I don't get how it could be different. Maybe the they change the encoding to UTF-8? But MediaInfoLib should manage the difference and show the same MediaInfo::Get()
output.
Could you have a debug on MediaInfo::Get()
result and output the hex dump of the output from the 2 files?
TPE1(encoding=<Encoding.LATIN1: 0>, text=['Âme'])
MediaInfoLib parses it as Latin1 too and provides the string in Unicode or UTF-8, so no reason that mediainfo.js shows "�me" with this file on the command line.
So weird that it does work in your demo. I guess the difference is that I am reading chunks using the fetch API (file is hosted on Amazon S3) and the Range header, whereas your demo reads the whole file using FileReader
Input kind should not change MediaInfoLib behavior.
I think the mediainfo CLI converts those strings to UTF-8 before printing them to the screen.
It converts to local code page. But if the OS is misconfigured without e.g. LC_CTYPE to e.g. "en_US.UTF-8" it may convert internal Unicode to C locale (so Latin1... but from a UTF-8 string, so issue). Maybe similar issue with mediainfo.js?
A lot of flags are disabled for the emscripten build.
Should not have any impact. There is a known issue with modern platforms (having UTF-8 encoding for the terminal) and such MP3 files with Latin1 encoding only if MediaInfoLib is compiled in non Unicode mode (MEDIAINFO_UNICODE_NO).
If possible, a first point of debug would be the result of MediaInfo::Get()
and see how it is encoded (Unicode chars? UTF-8?) so you know if the issue is from MediaInfoLib or something else (mediainfo.js or the platform config somewhere not handling UTF-8).
Thanks for chiming in @JeromeMartinez I'd love to help out more with this but I'm having a difficult time getting the project to build 🙈
Anyhow! I did figure out that it seems to be a problem with mediainfo.js v0.2
I've updated the demo from the gh-pages-src
branch to v0.2.1 (from v0.1.9) and then the encoding problem showed up there too:
So I guess we can exclude any app specific code.
I created a mp3 file with different encodings. Most media players and tag editors display the tags just fine (i.e. EasyTAG fails to display the UTF-16BE tag).
A test case and the mp3 test file can be found in the 150-id3-character-encodings branch.
Findings for different flavors of mediainfo:
These are just some preliminary findings. I'll have to see when I find the time to do some version bisecting on mediainfo.js.
Thanks @buzz
I managed to make a build with a Github action.
I've removed the --disable-unicode
flag from the zenlib compilation, which I assume also put mediainfolib in non-unicode mode, and added the LC_CTYPE=en_US.UTF-8
env var.
But... no luck.
❌ https://mediaarea.net/MediaInfoOnline - Mediainfo official WASM build (MediaInfoLib v24.01) ❌ mediainfo CLI (MediaInfoLib v24.01)
This is annoying... Weird, the Windows version is fine, maybe some misconfiguration somewhere about locale. Note that MediaInfo CLI v22.09 has also the issue, so it seems that there is no change there on our side.
IMO 2 issues:
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 30 days since being marked as stale.
Checklist
Bug Description
The metadata of an MP3 file has character encoding issues. It's listing the artist as
�me
instead ofÂme
Steps to Reproduce
Where
getSize
andreadChunk
is: https://github.com/icidasset/diffuse/blob/b11d5b0a8d204ed3db401c0eedb8da25e4076131/src/Javascript/processing.ts#L116-L155Using the file: https://www.dropbox.com/scl/fi/u7qw9rgo7xjmcajsh1oex/211-me_-_rej.mp3?rlkey=tmwckk19p4n7r7gd5g8ngk7cz&dl=0
You can try it out in the app itself using this branch: https://github.com/icidasset/diffuse/tree/encoding-issue-mediainfo
You'll have to upload the file to a supported service in order to test it though. I can do that for you if you want, let me know.
Expected Behavior
Expected to see the artist
Âme
Actual Behavior
Got the artist
�me
Environment
Additional Information
I tested this with the mediainfo CLI tool installed via homebrew where it did parse the metadata correctly, uses MediaInfoLib
v24.01
. Other metadata parsers show the info correctly too.Thanks for this great project!