JamesHeinrich / getID3

http://www.getid3.org/
Other
1.15k stars 245 forks source link

WAV file problem combining non-latin RIFF and ID3v2 tags #338

Closed paulijar closed 3 years ago

paulijar commented 3 years ago

I ran in to the following problem when testing WAV files tagged with the Mp3tag application. It seemed to happen only when the tags contained mixed Latin and non-Latin scripts, although I didn't test this quite extensively.

When tagging with non-Latin characters, Mp3tag writes the UTF-8-encoded data to ID3v2.3 tags, and a "substitute strings" to the RIFF header. In the substitute string, all non-Latin characters are replaced with ? characters. Now, when getID3 combines the different kinds of tags with CopyTagsToComments, it cannot merge these RIFF tags and ID3v2.3 tags properly. Instead, the [comments] section of the result contains both versions of the tags, and what's worse, the RIFF tag with all those ? characters comes first.

Meanwhile, if the same tag contents are saved to a mp3 file, the strategy used by Mp3tag app is pretty much the same: The UTF-8 data goes to ID3v2.3 and corresponding substitute string goes to ID3v1. But in this case, getID3 is smart enough to merge the tags so that the [comments] field contains only the UTF-8-encoded data.

I have uploaded a pair of sample files here, one wav and one mp3, both defining the same title/album/artis tags: https://drive.google.com/drive/folders/1qevkYHRrmPvN5lFYaaxgJVOfF9WK4e5h?usp=sharing

Here are the corresponding analyze results after CopyTagsToComments: analyze_results_mp3.txt analyze_results_wav.txt

This was detected on the getID3 version 1.9.20-202107131440.

JamesHeinrich commented 3 years ago

Should be fixed in https://github.com/JamesHeinrich/getID3/commit/5f6d2ace45d90d3c2a1ed7e0ed9e4bb3e9857bbc Thanks for the sample files.

paulijar commented 3 years ago

Thanks for the quick response and action. However, I can still find some residual cases. One of them is the real-life music file from one of my users which has title 永夜抄 ~ Eastern Night. Here, the problematic part seems to be the full-width space following the Japanese Kanji characters as well as the full-width tilde next to it. When Mp3tag transliterates these to 8859-1 -compatible format, it converts these characters into normal space and normal tilde characters, and apparently getID3 is only looking for characters replaced with ? characters. I'm sure that also many other punctuation characters have full-width variants which are used in Chinese and Japanese.

I have uploaded new files non-latin2.wav and non-latin2.mp3 to the previous share to demonstrate this full-width character issue. It can be seen, that the problem is present both on wav and mp3. However, on mp3 it is not really a problem for my app, because the ID3v1 tags are placed last and I'm only reading the first tag of each kind. I was wondering, would it be possible to move the RIFF tags to be handled after ID3v2 in similar manner?

JamesHeinrich commented 3 years ago

Good idea. ID3v1 and RIFF tags are now both processed after any other tag types (if present) so the first entry in comments is more likely to be accurate. https://github.com/JamesHeinrich/getID3/commit/4e02ed09c081a606c734c6b27d1b504fecfe402f

paulijar commented 3 years ago

Thanks, works fine for my use cases now.