Chinese encoding exception

Zeugma440 / atldotnet

Fully managed, portable and easy-to-use C# library to read and edit audio data and metadata (tags) from various audio formats, playlists and CUE sheets

MIT License

460 stars 61 forks source link

Chinese encoding exception #253

Closed j4587698 closed 7 months ago

j4587698 commented 7 months ago

The problem

I have an MP3 file, which displays the title and artist correctly on Windows, but it shows up as garbled text in the library.

Environment

ATL version (or git revision) that exhibits the issue: Latest
Last ATL version that did not exhibit the issue (if applicable):
OS/version used to run ATL:Windows11

Details

This screenshot is from viewing it on Windows, where 1 is the title, and 2 is the artist.

This screenshot is from the library. Wherein the artist has turned into garbled text.

My mp3 file is as follows. 4249925507.zip

Could this be the same issue as #147 ?

Zeugma440 commented 7 months ago

Hello there,

First of all, #147 is about a CUE file, not about an audio file. It's not the same issue.

The problem you have is as follows :

Your file is tagged with ID3v1
The ID3v1 specification does not set any encoding for strings. Most implementations use ANSI or Latin-1 (ISO-8859-1), but others simply use the device's default locale. That's probably what happened with your file which seems to have been tagged on a Chinese device
The Windows explorer uses your default locale to decode ID3v1 strings, that's why it is able to read it properly
ATL uses hardcoded Latin-1 to decode and encode ID3v1 strings

One possible solution would be to make ATL decode and encode ID3v1 strings with Settings.DefaultTextEncoding, which you could set to Encoding.Default to be able to handle Chinese strings... But then you'd have issue handling western ID3v1 tags which are actually using Latin-1 encoding 😅

Is that worth trying?

j4587698 commented 7 months ago

I tried using Default and GB2312, but the content is still garbled

j4587698 commented 7 months ago

Could we try to perform encoding detection on id3v1? I remember that encoding detection can distinguish between ANSI and Latin-1, and since id3v1 is not large, only 128 bytes, the detection speed should be acceptable.

Zeugma440 commented 7 months ago

Could we try to perform encoding detection on id3v1? I remember that encoding detection can distinguish between ANSI and Latin-1, and since id3v1 is not large, only 128 bytes, the detection speed should be acceptable.

I actually tried that yesterday, trying both UDE and UTF-Unknown. Unfortunately, there's not enough data for the encoding detection to succeed, especially on the file you submitted where there are 10 valid bytes to play with.

I tried using Default and GB2312, but the content is still garbled

What I wrote was a suggestion to see if you'd agree. Latest version of ATL doesn't use Settings.DefaultTextEncoding to decode ID3v1. Does that mean I can implement that?

j4587698 commented 7 months ago

Yes, you are right. However, since Latin-1 is a lossless mapping, I might be able to handle this issue on my own. I can re-parse the string into bytes, and then proceed with the Chinese character determination. Simply determining Chinese characters is straightforward, as GB2312 has certain regularities, as do BIG5, Shift-JIS, EUC-KR, and others. However, this kind of regularity might lead to some errors in the determination of certain Western characters. But under circumstances with a certain bias, it should be acceptable.

Zeugma440 commented 7 months ago

Sounds feasible too. Let me know if you need anything 👍

I'll close the issue in 30 days if nothing happened.