Mach5 / supersonic

Open-source web-based media streamer and jukebox fork of Subsonic. Supports MP3, OGG, AAC and other streamable audio and video formats. Runs on Windows, GNU/Linux and Mac using Java.
243 stars 60 forks source link

Wrong display for non-ASCII tags #54

Open sciurius opened 12 years ago

sciurius commented 12 years ago

In my collection, I have an album from Sinéad O'Connor.

In the recently added (and recently played, ...) views, this displays as Sinéad O'Connor. When the album itself is selected, the artist name displays okay.

timoreimann commented 12 years ago

Are you referring to the Android app, the web frontend, or both?

porkcharsui commented 12 years ago

I am seeing something very similar in my collection with Sigur Rós which appears as Sigur Rós in both the web and Android clients. The song result itself does not show this error, just the directory name and artist search results, leading me to believe this is isolated to portions of the backend's use of character encoding.

timoreimann commented 12 years ago

portcharsui's assumption seems correct: I tested things myself by creating an album which contains a ó character in both album and artist file and tag names, respectively. Result: Everything looks fine on both the web and Android client.

My Linux box is using UTF-8 for all locale settings, including LANG and LC_ALL. Could anyone affected by the issue please check his locale and see if switching to UTF-8 (possibly by using one of these methods) helps? (Don't forget to re-encode the ID3 portion too. I use eyeD3 for that.)

timoreimann commented 12 years ago

Closing the issue as it seems fixed. Please report back if the problem persists in a reproducible way independent of a correctly configured locale environment.

sciurius commented 12 years ago

http://www.squirrel.nl/pub/xfer/uploads/3CAcxyTsu3l9UgXeWkLkIoyQ.png

This is the current (as of yesterday) git version.

ID3 data for this track:

id3v1 tag info for 01_4th_And_Vine.mp3: Title : 4th And Vine Artist: Sinéad O'Connor
Album : How About I Be Me, And You Be Year: , Genre: Unknown (255) Comment: Track: 1 id3v2 tag info for 01_4th_And_Vine.mp3: TALB (Album/Movie/Show title): How About I Be Me, And You Be You TCON (Content type): (255) (255) TIT2 (Title/songname/content description): 4th And Vine TPE1 (Lead performer(s)/Soloist(s)): Sinéad O'Connor TRCK (Track number/Position in set): 1

timoreimann commented 12 years ago

@sciurius, what OS are you running? If Linux, can you provide the output of the locale command?

sciurius commented 12 years ago

I think Fedora17 qualifies as Linux ;)

% cat /etc/redhat-release Fedora release 17 (Beefy Miracle) % uname -a Linux phoenix.squirrel.nl 3.5.0-2.fc17.x86_64 #1 SMP Mon Jul 30 14:48:59 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux % locale LANG=en_US.utf-8 LC_CTYPE="en_US.utf-8" LC_NUMERIC="en_US.utf-8" LC_TIME=en_GB LC_COLLATE="en_US.utf-8" LC_MONETARY="en_US.utf-8" LC_MESSAGES="en_US.utf-8" LC_PAPER=nl_NL LC_NAME="en_US.utf-8" LC_ADDRESS="en_US.utf-8" LC_TELEPHONE="en_US.utf-8" LC_MEASUREMENT="en_US.utf-8" LC_IDENTIFICATION="en_US.utf-8" LC_ALL=

timoreimann commented 12 years ago

Adding one such affected MP3 file to my Supersonic instance, I can confirm that the artist's name looks borked. However, when I edit the MP3 and re-set the artist's name (by means of copying it off Wikipedia and pasting it into the artist field using an MP3 tag editor on my Mac), it looks perfectly right afterwards in Supersonic: All non-ASCII characters are rendered properly.

So, my conclusion is that it is not a Supersonic issue. Apparently, an invalid tag encoding has been applied in the first place.

sciurius, can you try to reproduce what I did and see if it confirms my findings?

sciurius commented 12 years ago

Adding one such affected MP3 file to my Supersonic instance, I can confirm that the artist's name looks borked. However, when I edit the MP3 and re-set the artist's name (by means of copying it off Wikipedia and pasting it into the artist field using an MP3 tag editor on my Mac), it looks perfectly right afterwards in Supersonic: All non-ASCII characters are rendered properly.

Yes, it is relatively easy to avoid this from happening. But this does not make the problem (= a bug in the Su*Sonic code) go away. Only the symptoms disappear.

So, my conclusion is that it is not a Supersonic issue. Apparently, an invalid tag encoding has been applied in the first place.

Even if this were the case, the same tag is rendered differently in two places. That indicates a coding error somewhere.

E.g., when I search for "o'connor" I get the following results:

Play Add Download Sinéad O'Connor Play Add Download Tony O'Connor Play Add Download Sinéad O'Connor Play Add Download Yo-Yo Ma, Edgar Meyer, Mark O'Connor

So there are two distinct but visually identical results for Sinéad O'Connor. The first result has 6 CD's, the second result has 3 CD's.

The same tracks (files) are processed by the Logitech Media Server (a.k.a. SqueezeBoxServer) and this server has no problems with the tags.

timoreimann commented 12 years ago

The same tag isn't rendered differently in Supersonic. Instead, Supersonic uses the directory name for the album name in the breadcrumb navigation, and the tag information for the playlist content. That's why the same label is shown properly one time and improperly a second time. So it's not a bug in Supersonic, just different approaches to achieve the same thing. (Admittedly, one can argue that the implementation should be more consistent.)

I'm not too familiar with Squeeze Box so I cannot tell why (and how) it does better. Maybe it reads the artist's name from another well-encoded field (like album artist) and prefers that one over the broken one. Or it uses the file name. Or, if encoding schemas are stored within MP3 tags along the payload (which I need to figure out), it may decode the content accordingly. I'll need to dig somewhat into the MP3 specs to tell what the best approach for Supersonic would be.

sciurius commented 12 years ago

Timo Reimann notifications@github.com writes:

The same tag isn't rendered differently in Supersonic. Instead, Supersonic uses the directory name for the album name in the breadcrumb navigation, and the tag information for the playlist content.

Ah! This explains some of the things I see ...

Personally, I'd advice against using directory names for this purpose, since file/directory names are essentially binary blobs without specified encoding. Interpretation depends on how the file system was mounted, which is hard to take into account.

I'm not too familiar with Squeeze Box so I cannot tell why (and how) it does better. Maybe it reads the artist's name from another well-encoded field (like album artist) and prefers that one over the broken one. Or it uses the file name.

I moved the offending album to another directory and requested a scan from both SuperSonic and SqueezeboxServer.

For SqueezeboxServer it doesn't make any difference. Apparently it uses the MP3 tags only.

For Su*Sonic, the album does, indeed, appear under a new (artist) name.

Still, while SqueezeboxServer and other tools(!) derive the correct information from the album tracks, Su*Sonic is the only one that has a problem.

Also still, it doesn't explain why Sinéad O'Connor is mentioned twice, with the albums distributed over both entries.

BTW: The current Su*Sonic approach also means that collection albums with different artists do not appear with the artist names. SqueezeboxServer does this right.

-- Johan

timoreimann commented 12 years ago

I agree with what you say regarding not to use directory names for information display. To my defense, I didn't implement it this way, and wouldn't do so. :) I'll open a separate issue later to deal with this. In general, Su*sonic lacks capability to fully leverage tag data which is a shortcoming IMHO.

I took just a rough look at the MP3 specification: Apparently, it does include encoding information, so it should be possible to decode tag data appropriately (unless, of course, for those cases where the encoding is completely broken). We'll use this issue to indicate progress on the problem.

timoreimann commented 12 years ago

I dug a little deeper into the Supersonic code: Apparently, it uses the Jaudiotagger library (in a fairly recent version) to decode audio tag data. One interesting bug report (JAUDIOTAGGER-179) deals with the request to allow proper decoding of MP3 tags in local encodings (like code pages). The Jaudiotagger author clearly states that anything but ISO-8859-1, variants of UTF-16, and UTF-8 (depending on the exact MP3v2 version being used) is illegal and won't be supported.

It looks like auto-detecting a local encoding is not possible (at least not in a standard-compliant fashion) whereas the supported encodings identify themselves by means of a marking encoding byte (see Wikipedia). I haven't looked into the Jaudiotagger code but strongly assume that decoding standard-compliant encodings works, so the observed behavior is most likely due to one such illegal local encoding. While the Jaudiotagger author announced to add an option to specify the (local) encoding used when decoding an MP3 tag, it wouldn't be a great solution IMHO because it'd only work for people willing and capable of specifying the employed encoding. Re-writing tag data in a supported encoding seems to be the way to go.

So much to what I figured about Jaudiotagger. Next, I'll try to find out what SqueezeBox Server does.

sciurius commented 12 years ago

Timo Reimann notifications@github.com writes:

One interesting bug report deals with the request to allow proper decoding of MP3 tags in local encodings (like code pages). The Jaudiotagger author clearly states that anything but ISO-8859-1, variants of UTF-16, and UTF-8 (depending on the exact MP3v2 version being used) is illegal and won't be supported.

One of the questions that keeps bugging me is what is wrong with the offending mp3 files. All tools that I've used so far return consistent and correct (i.e., as desired) information. AFAIK all tags are UTF8 encoded and no local encodings are used.

Does Jaudiotagger have a tool to display the 'raw' tag data?

-- Johan

timoreimann commented 12 years ago

Jaudiotagger does not seem to provide any tools. However, putting together a piece of code to read out the raw tag data should be a matter of a couple lines only; the Jaudiotagger code examples may be helpful with that.

Note that UTF-8 is allowed in ID3v2.4 only. It's probably safer to use UTF-16 with ID3v2.3.

timoreimann commented 12 years ago

In order to take a closer look at the ID3 data, I build a simple program using Jaudiotagger 2.0.4. It looks like this:

import org.jaudiotagger.audio.AudioFileIO;
import org.jaudiotagger.audio.mp3.MP3File;
import org.jaudiotagger.tag.FieldKey;
import org.jaudiotagger.tag.Tag;
import org.jaudiotagger.tag.id3.*;

import java.io.File;
import java.nio.charset.Charset;

public class ParseMp3 {
    private static String mp3File = "01_4th_And_Vine.mp3";

    public static void main(String[] args) {
        File file = new File(ParseMp3.mp3File);
        MP3File mp3File = null;
        try {
            mp3File = (MP3File)AudioFileIO.read(file);
        } catch (Exception e) {
            e.printStackTrace();
        }

        Tag tag = mp3File.getTag();
        System.out.println("artist: " + tag.getFirst(FieldKey.ARTIST));

        ID3v1Tag id3v1Tag = mp3File.getID3v1Tag();
        System.out.println("ID3v1 artist: " + id3v1Tag.getFirst(FieldKey.ARTIST));

        AbstractID3v2Tag id3v23Tag = mp3File.getID3v2Tag();
        System.out.println("ID3v2.3 artist: " + id3v23Tag.getFirst(ID3v23Frames.FRAME_ID_V3_ARTIST));
        AbstractID3v2Frame id3v23Frame = id3v23Tag.getFirstField(ID3v23Frames.FRAME_ID_V3_ARTIST);
        System.out.println("ID3v2.3 encoding: " + id3v23Frame.getEncoding());

        ID3v24Tag id3v24Tag = mp3File.getID3v2TagAsv24();
        System.out.println("ID3v2.4 artist: " + id3v24Tag.getFirst(ID3v24Frames.FRAME_ID_ARTIST));
        AbstractID3v2Frame id3v24Frame = id3v24Tag.getFirstField(ID3v24Frames.FRAME_ID_ARTIST);
        System.out.println("ID3v2.4 encoding: " + id3v24Frame.getEncoding());

        byte[] rawData = id3v23Frame.getRawContent();
        System.out.println("analyzing " + rawData.length + " bytes of raw ID3v2.3 artist data:");
        String isoText = new String(rawData, Charset.forName("ISO-8859-1"));
        String utfText = new String(rawData, Charset.forName("UTF-8"));
        for (int i = 0; i < rawData.length; ++i) {
            char singleByte = (char)rawData[i];
            char utfChar = '\0';
            try {
                utfChar = utfText.charAt(i);
            } catch (StringIndexOutOfBoundsException exception) {
            }
            System.out.println("ID3v2 artist raw byte #" + String.format("%02d", i+1) + ": " +
                    String.format("%5d", (int)singleByte) + "\t\thex: " +
                    String.format("%8s", Integer.toHexString(singleByte)) + "\t\tISO: " +
                    isoText.charAt(i) + "\t\tUTF: " + utfChar);
        }
    }
}

The program's output is this:

artist: Sinéad O'Connor
ID3v1 artist: Sinéad O'Connor
ID3v2.3 artist: Sinéad O'Connor
ID3v2.3 encoding: ISO-8859-1
ID3v2.4 artist: Sinéad O'Connor
ID3v2.4 encoding: ISO-8859-1
analyzing 27 bytes of raw ID3v2.3 artist data:
ID3v2 artist raw byte #01:    84        hex:       54       ISO: T      UTF: T
ID3v2 artist raw byte #02:    80        hex:       50       ISO: P      UTF: P
ID3v2 artist raw byte #03:    69        hex:       45       ISO: E      UTF: E
ID3v2 artist raw byte #04:    49        hex:       31       ISO: 1      UTF: 1
ID3v2 artist raw byte #05:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #06:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #07:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #08:    17        hex:       11       ISO:       UTF: 
ID3v2 artist raw byte #09:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #10:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #11:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #12:    83        hex:       53       ISO: S      UTF: S
ID3v2 artist raw byte #13:   105        hex:       69       ISO: i      UTF: i
ID3v2 artist raw byte #14:   110        hex:       6e       ISO: n      UTF: n
ID3v2 artist raw byte #15: 65475        hex:     ffc3       ISO: Ã     UTF: é
ID3v2 artist raw byte #16: 65449        hex:     ffa9       ISO: ©     UTF: a
ID3v2 artist raw byte #17:    97        hex:       61       ISO: a      UTF: d
ID3v2 artist raw byte #18:   100        hex:       64       ISO: d      UTF:  
ID3v2 artist raw byte #19:    32        hex:       20       ISO:        UTF: O
ID3v2 artist raw byte #20:    79        hex:       4f       ISO: O      UTF: '
ID3v2 artist raw byte #21:    39        hex:       27       ISO: '      UTF: C
ID3v2 artist raw byte #22:    67        hex:       43       ISO: C      UTF: o
ID3v2 artist raw byte #23:   111        hex:       6f       ISO: o      UTF: n
ID3v2 artist raw byte #24:   110        hex:       6e       ISO: n      UTF: n
ID3v2 artist raw byte #25:   110        hex:       6e       ISO: n      UTF: o
ID3v2 artist raw byte #26:   111        hex:       6f       ISO: o      UTF: r
ID3v2 artist raw byte #27:   114        hex:       72       ISO: r      UTF: 

Process finished with exit code 0

(I parsed ID3v2.3 data in the example but the results are exactly the same with 2.4.)

According to the ID3v2 frame specification, the bytes outlined above carry the following semantics:

As you can see, the artist name was supposedly stored in ISO-8859-1 (byte 00 after the frame size). Taking a closer look at bytes 15 and 16, however, one can observe that in fact UTF-8 was written to the ID3 field: The é character takes up two bytes in UTF-8 which, if mistakenly interpreted as ISO-8859-1, leads to two distinct characters à and ©. (This seems to be a common symptom for UTF-8/ISO-8859-1-related issues.) Another indication for the lack of proper UTF-8 encoding is the fact that the BOM (either FFFE or FEFF) is missing.

So Jaudiotagger seems to be doing everything right. I can't tell for sure whether those other ID3-reading applications you are using are simply lucky to, say, choose your machine's locale setting for decoding, or sophisticated enough to try some kind of encoding detection. While the latter still seems desirable to have in Supersonic for certain cases, I believe that in this one the ID3 encoding is indeed broken.

HTH,

--Timo

sciurius commented 12 years ago

Timo Reimann notifications@github.com writes:

As you can see, the artist name was supposedly stored in ISO-8859-1 (byte 00 after the frame size). Taking a closer look at bytes 15 and 16, however, one can observe that in fact UTF-8 was written to the ID3 field:

Ha! Now I finally know exactly what is wrong. I already made a lot of guesses and assumptions but your data make it all clear.

Most tools that I used apparently assume it's utf8 unless it contains erroneous characters, and then fall back to Latin-1.

Thanks! Now I can start working on a goot solution (which may involve re-tagging all my mp3 files...).

Thanks again.

-- Johan

timoreimann commented 12 years ago

I was curious and gave juniversalchardet a try, a Java port of Mozilla's Universal Charset Detection implementation. Although the project looks a little bit neglected it managed to detect UTF-8 in the malformed files correctly. Since there are so many illegal music files out there, I'd like to incorporate the library in a Supersonic branch and see how it works.

For the sake of completeness, I also tried out another Java alternative called jChardet. It failed to detect UTF-8, however.