APE file unidentifiable Chinese code

melinyi commented 2 years ago

Hi, I discovered the APE file that a Chinese information is garbled,

/file/d/1D_MSa40Y55o2N0KLsqZuyTPBnt3Rr8-Y/view?usp=sharing

melinyi commented 2 years ago

{ "Path": "C:\Users\Line\Desktop\赵媛 - 东方美\男孩不哭 1989 APE\03.嘿!你写日记吗.ape", "Title": "03��!��д�ռ��?", "Artist": "С��", "Composer": "", "Comment": "Exact Audio Copy", "Genre": "Pop", "Album": "�к��", "OriginalAlbum": "", "OriginalArtist": "", "Copyright": "", "Description": "", "Publisher": "", "PublishingDate": "0001-01-01T00:00:00", "AlbumArtist": "", "Conductor": "", "ProductId": "", "Date": "1989-01-01T00:00:00", "Year": 1989, "TrackNumber": 3, "TrackTotal": 0, "DiscNumber": 0, "DiscTotal": 0, "Popularity": 0.0, "PictureTokens": [ ], "ChaptersTableDescription": "", "Chapters": [ ], "Lyrics": { "ContentType": 1, "Description": "", "LanguageCode": "", "UnsynchronizedLyrics": "", "SynchronizedLyrics": [ ] }, "AdditionalFields": { }, "MetadataFormats": [ [ "ape" ] ], "Bitrate": 705, "SampleRate": 44100.0, "IsVBR": false, "CodecFamily": 1, "AudioFormat": [ ".ape" ], "Duration": 272, "DurationMs": 271666.6666666667, "ChannelsArrangement": { "Description": "Stereo (2/0.0)", "NbChannels": 2 }, "TechnicalInformation": { "AudioDataOffset": 0, "AudioDataSize": 23954860 }, "EmbeddedPictures": [ ] }

melinyi commented 2 years ago

I have uploaded the file, which may have similar problems with the previous CUE file

Zeugma440 commented 2 years ago

As opposed to CUE files whose encoding is completely free, the APE tag specs recommend using UTF-8 (see https://wiki.hydrogenaud.io/index.php?title=APE_key), which is what ATL is using.

My guess is some taggers may have chosen to use the default encoding of the OS they're running on. This works fine as long as you're reading these files on devices with that same default encoding. But it's a bad idea for interoperability.

On ATL's side, When specs formally recommend the use of a given encoding, we can't ignore that and assume every file is non-compliant. Plus trying to guess the encoding of every value in every file would strongly degrade performance when using the library for mass operations.

=> The only solution I can think of is to allow you to manually set the encoding you'd like to decode/encode APE tags with, by adding a new field to ATL.Settings. Would that work for you ?

melinyi commented 2 years ago

These contents are sent to me by users, I only know they come from Taiwan, China, AND I don't know how to deal with them.

These files are usually album folders

Zeugma440 commented 2 years ago

If you don't have control over these files, I could set up an optional "encoding discovery mode" that might do the trick in that case. I'll experiment with that and get back to you.

melinyi commented 2 years ago

I just couldn't read it from MP3TAG either, because MP3TAG directly filters the garble

melinyi commented 2 years ago

如果您无法控制这些文件，我可以设置一个可选的“编码发现模式”，在这种情况下可能会起到作用。我会对此进行试验并回复您。

Thank you

melinyi commented 2 years ago

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
ATL.Settings.DefaultTextEncoding = System.Text.Encoding.GetEncoding("GB18030");
ATL.Track track = new ATL.Track(@"C:\Users\Line\Desktop\赵媛 - 东方美\男孩不哭  1989 APE\03.嘿!你写日记吗.ape");
System.Diagnostics.Debug.WriteLine(track.Title);

@Zeugma440 ATL.Settings. Doesn't seem to work

Zeugma440 commented 2 years ago

I haven't done anything yet for this issue. Please wait until I confirm the fix and explain how to use the new settings.

Zeugma440 commented 2 years ago

@melinyi I got bad news... The library I successfuly used to guess CUE sheets encoding (see #138) fails to guess the encoding of the fields from the file you sent. I suppose there's to little data for it to work properly.

That leaves us with the option of using ATL.Settings.DefaultTextEncoding exactly as you just tried. I can make it work with APE files, but it will decode every file with the encoding you set, even files that actually use UTF-8.

That solution would only be good if the vast majority of the APE files you're manipulating have their tags encoded with something else than UTF-8 (e.g. GB18030).

Let me know if that's okay for you.

melinyi commented 2 years ago

https://tieba.baidu.com/p/5410932979

I found an article that said something like this:

Causes of garbled code

The reason for the garbled code is mainly caused by the difference between the code used in the label text and the code displayed. Each label format supports the following codes:

ID3v1: Supports only ISO-8859-1

ID3v2 2.3: ISO-8859-1, UTF-16

ID3v2 2.4: ISO-8859-1, UTF-16, UTF-8

APEv2: utf-8

The average MP3 player reads the tag content according to the ID3 standard encoding (for example, all Linux/FreeBSD systems rely on the Libid3Tag library). From the above encoding can be seen, no matter what kind of standard mp3 tag (ID3v1, ID3v2, APEv2), as long as the content of the MP3 tag is Unicode encoding storage, then the display must be normal (ID3v1 ISO-8859-1 strictly does not support Chinese, But that doesn't mean it can't store Chinese. If the encounter is GBK, GB18030, BIG5 and other coding Chinese content, the player will still regard it as ISO-8859-1 to read, garbled code has become inevitable.

Most strings stored in ID3v1 use ANSI codes. So the encoding used is different depending on the language of the system. If the simplified Chinese system uses GB2312, the traditional Chinese system is BIG5, and the Japanese operating system is JIS code. So, if the MP3 only has ID3v1 information, the information displayed in the player of different language operating system will be garbled.

Id3v2-3 is saved at the beginning of the MP3 file. There is a label header and several label frames. According to the ID3V2-3 standard, in text label frames, a byte after the byte representing the frame size can be used to indicate whether the frame is isO-8859-1 or Unicode. This byte is 0 if isO-8859-1 is used, and 1 if Unicode is used. Use a Unicode Bom (0xFF 0xFE) beginning and two zeros (0x00 0x00) ending.

So if you want to save Chinese information, it is best to use ID3v2-3 to save Unicode strings. This MP3 in each system below the playback will not appear the problem of garbled code.

Garbled solutions

We just need to put the MP3 tag inside with GBK, GB18030, BIG5 and other encoding stored In Chinese content to Unicode encoding, then basically all players can recognize the MP3 tag normally. Considering that ID3v1 does not support Chinese in principle and the label length is too short, ID3v2 label should be used to display Chinese information in general (most players support ID3V2.3). In addition, you may need to consider whether the local Settings are UTF-8 or UTF-16. Generally, change the Settings to UTF-16. If you use UTF-8, you can convert MP3 labels to ID3v2.4.

= = = = = = = = =

The above reprint is to explain the principle, the following is the focus of the use of AIMP:

In Windows, you can modify labels by using the tag editor delivered with AIMP or by using a third-party tag editing tool (such as MP3Tag).

For full-track files using CUE, if garbled characters occur (garbled characters occur after editing with default labels), you need to convert the coding format of CUE files to UTF-8-BOM.

Similarly, plugged-in. LRC lyrics file garble problem, the same lyrics file encoding format converted to UTF-8-BOM.

On Android, you can change windows1251 to GBK in the Settings - playlist-ansi TAB default code, and then drop down to refresh the playlist.

melinyi commented 2 years ago

Is the following method feasible?

We just need to change the tag label inside with GBK, GB18030, BIG5 and other encoding stored Chinese content to Unicode encoding, then basically all players can recognize the tag label normally.

melinyi commented 2 years ago

I agree to your plan，I read your source code this afternoon and tried to get the encoding of the corresponding bytes, but I couldn't get the encoding format. At present, the best solution is to use ATL Settings. Defaulttextencoding applies to Apetag.cs in

Zeugma440 commented 2 years ago

https://tieba.baidu.com/p/5410932979

All that is written there is correct. And somehow, if we could convert all these tags to a "spec-compliant" encoding, that could indeed fix our problems. However : 1/ You told me your have no control over these audio files, as they are uploaded by your clients 2/ The primary problem is that we have no reliable method to find out which encoding is used for a given APE file

I agree to your plan，I read your source code this afternoon and tried to get the encoding of the corresponding bytes, but I couldn't get the encoding format. At present, the best solution is to use ATL Settings. Defaulttextencoding applies to Apetag.cs in

Alright, thanks for your feedback. I've just made it possible. This will be shipped on next release.

Zeugma440 commented 2 years ago

Available in today's v4.06.

melinyi commented 2 years ago

Thank you. I just read a Chinese Encoding detection code, in which the constructor sets an exception whose length cannot be less than 100. I think that's the problem.

Zeugma440 commented 2 years ago

Indeed. If we just have a title and an author, there's too little data to work with.

Zeugma440 / atldotnet

APE file unidentifiable Chinese code #143