drunkenQCat commented 1 year ago

The problem

When I was writing some Chinese metadata to a wav file, the metadata written in was some random code. I tried to decipher these garbled codes and found that they were encoded by ISO-8859-1 but decoded by utf8. Besides that, all the Chinese metadata written in my bext turned into question marks, which in binary 3F. I am wondering why. Is there any way to avoid writing garbled code?

Environment

tested on codespace and windows in dotnet 7

Details

Zeugma440 commented 1 year ago

Hi and thanks for your feedback.

Could you please explain which field you're saving to ? WAV has many chunks that follow different specifications.

drunkenQCat commented 1 year ago

    public void WriteMetaData() 
     { 
         foreach (var item in LogList) 
         { 
             foreach (var bwf in item.bwfList) 
             { 
                 Track tr = new(bwf.FullName); 
                 WriteAdditional(tr, "ixml.SCENE", item.scn + "-" + item.sht); 
                 WriteAdditional(tr, "ixml.TAKE", item.tk.ToString()); 
                 WriteAdditional(tr, "ixml.NOTE", item.scnNote + "," + item.shtNote); 
                 WriteAdditional(tr, "ixml.CIRCLED", (item.okTk == TkStatus.ok) ? "TRUE" : "FALSE"); 
                 WriteAdditional(tr, "ixml.TAKE_TYPE", (item.okTk == TkStatus.bad) ? "NO_GOOD" : "DEFAULT"); 
                 WriteAdditional(tr, "ixml.WILD_TRACK", (item.tkNote.Contains("wild")) ? "TRUE" : "FALSE"); 
                 tr.Description = item.tkNote; 
                 tr.Title = item.shtNote; 
                 tr.Save(); 
             } 
         } 
     } 

     void WriteAdditional(Track tr, string tag, string content) 
     { 
         if (tr.AdditionalFields.ContainsKey(tag)) tr.AdditionalFields[tag] = content; 
         else tr.AdditionalFields.Add(tag, content); 
     }

the random code happened in ixml.NOTE and question mark in description and title.

drunkenQCat commented 1 year ago

I tried to modify the source to make it enabled to write the utf8 information I need.

203

it fixed. the picture I show in Details is the problem of waveagent. the utf8 information showed correctly in metadata management softwares. here is an example in reaper:

the title is still random code in File Explorer because the default encoder of my system is GB2312.

that's the problem. I read CharsetDetector/UTF-unknown#143 and learn that it maybe the problem caused by this. So it is caused that the Settings.DefaultTextEncoding did not cover the other fields?

Zeugma440 commented 1 year ago

I tried to decipher these garbled codes and found that they were encoded by ISO-8859-1 but decoded by utf8

The places where you found garbled text are read and written using ISO-8859-1, which does not support oriental characters.

I've done that because of what specifications say :

BEXT (used for the description field) : Specifications say the string fields should be written using ASCII. However, ASCII being a subset of UTF-8, we can switch to UTF-8 without any issue👍
LIST INFO (used for the title field) : Specifications say the string fields should be written using ASCII. However, ASCII being a subset of UTF-8, we can switch to UTF-8 without any issue 👍
Other fields you've written use the iXML structure, which is already UTF-8-encoded 😄

the title is still random code in File Explorer because the default encoder of my system is GB2312.

Precisely. Western versions of Windows use ISO-8859-1 as their default encoding. They assume WAV metadata are encoded using ISO-8859-1, which works because WAV metadata is usually encoded using ASCII, which is a subset of ISO-8859-1.

Your version of Windows might be expecting GB2312, which is not compatible with UTF-8, hence the garbled characters displayed on the Explorer.

=> Another way of fixing that issue and make Windows happy would be to use Settings.DefaultTextEncoding instead of UTF-8 in the library code, and set Settings.DefaultTextEncoding to System.Text.Encoding.GetEncoding("GB2312") in your application code. That would fix the issue with your Windows, but would completely deviate from the BEXT and LIST INFO specifications, which would make the text you save unreadable on a western computer. That's why I'd rather hardcode UTF-8 as suggested above.

Do you agree with me on that one ?

I read https://github.com/CharsetDetector/UTF-unknown/issues/143 and learn that it maybe the problem caused by this.

This has nothing to do with WAV files. UTF-unknown is only used by the library to detect CUE sheets encoding.

drunkenQCat commented 1 year ago

Thanks for your detailed explaination, it answered a lot of problems. And I have to appologize for my ambgious description. I totally agree the answer, the random code on windows explorer in fact dosen't matter in sound production, I have felt the benefit of utf8 especially when I cooperate with others whose OS is macOS.

Beside, I finally find that the most important bug:

all the Chinese metadata written in my bext turned into question marks, which in binary 3F

is actually caused by

WavHelper.writeFixedTextValue(description, 256, w);

which uses Latin1Encoding as encoder to utf8 text. I inferred that Lain1Encoding.GetBytes(utf8Text)may return 3F(question mark) when out of range.

I varified the problem: It is actually caused by GetBytes.

Zeugma440 commented 1 year ago

Perfect, thanks for confirming 👍

I'm gonna publish a fixed version in the following days. Stay tuned~

Zeugma440 commented 1 year ago

Fix is available on today's v4.34

Zeugma440 / atldotnet

The Problem about ISO-8859-1 #202

The problem

Environment

Details

203