ID3v2 text frames with encoding type 1 but no BOM

wader commented 8 years ago

Hi, i have some mp3 files (BBC radio podcasts) that have UTLS frames with encoding type 1 but no BOM. They have string data in UTF16 little endian.

00000100  79 6e 65 73 2e 55 53 4c  54 00 00 01 84 00 00 01  |ynes.USLT.......|
                                                        ^- encoding 1
00000110  65 6e 67 44 00 65 00 73  00 63 00 72 00 69 00 70  |engD.e.s.c.r.i.p|
                   ^- no BOM :(
00000120  00 74 00 69 00 6f 00 6e  00 00 00 41 00 6d 00 65  |.t.i.o.n...A.m.e|
00000130  00 72 00 69 00 63 00 61  00 6e 00 20 00 73 00 61  |.r.i.c.a.n. .s.a|
00000140  00 74 00 69 00 72 00 69  00 73 00 74 00 20 00 4a  |.t.i.r.i.s.t. .J|
00000150  00 6f 00 65 00 20 00 51  00 75 00 65 00 65 00 6e  |.o.e. .Q.u.e.e.n|
00000160  00 61 00 6e 00 20 00 63  00 68 00 61 00 72 00 74  |.a.n. .c.h.a.r.t|
00000170  00 73 00 20 00 74 00 68  00 65 00 20 00 72 00 69  |.s. .t.h.e. .r.i|
00000180  00 73 00 65 00 20 00 61  00 6e 00 64 00 20 00 66  |.s.e. .a.n.d. .f|
00000190  00 61 00 6c 00 6c 00 20  00 6f 00 66 00 20 00 74  |.a.l.l. .o.f. .t|
000001a0  00 68 00 65 00 20 00 6e  00 75 00 64 00 67 00 65  |.h.e. .n.u.d.g.e|
000001b0  00 20 00 6e 00 75 00 64  00 67 00 65 00 20 00 77  |. .n.u.d.g.e. .w|
000001c0  00 69 00 6e 00 6b 00 20  00 77 00 69 00 6e 00 6b  |.i.n.k. .w.i.n.k|
000001d0  00 20 00 65 00 70 00 69  00 64 00 65 00 6d 00 69  |. .e.p.i.d.e.m.i|
000001e0  00 63 00 2c 00 20 00 77  00 69 00 74 00 68 00 20  |.c.,. .w.i.t.h. |
000001f0  00 68 00 65 00 6c 00 70  00 20 00 66 00 72 00 6f  |.h.e.l.p. .f.r.o|
00000200  00 6d 00 20 00 49 00 61  00 6e 00 20 00 48 00 69  |.m. .I.a.n. .H.i|
00000210  00 73 00 6c 00 6f 00 70  00 2c 00 20 00 4a 00 6f  |.s.l.o.p.,. .J.o|
00000220  00 68 00 6e 00 20 00 53  00 65 00 72 00 67 00 65  |.h.n. .S.e.r.g.e|
00000230  00 61 00 6e 00 74 00 2c  00 20 00 4b 00 61 00 74  |.a.n.t.,. .K.a.t|
00000240  00 68 00 79 00 20 00 4c  00 65 00 74 00 74 00 65  |.h.y. .L.e.t.t.e|
00000250  00 2c 00 20 00 42 00 61  00 72 00 72 00 79 00 20  |.,. .B.a.r.r.y. |
00000260  00 43 00 72 00 79 00 65  00 72 00 20 00 61 00 6e  |.C.r.y.e.r. .a.n|
00000270  00 64 00 20 00 4e 00 61  00 74 00 61 00 6c 00 69  |.d. .N.a.t.a.l.i|
00000280  00 65 00 20 00 48 00 61  00 79 00 6e 00 65 00 73  |.e. .H.a.y.n.e.s|
00000290  00 2e 00 54 43 4f 4e 00  00 00 08 00 00 00 50 6f  |...TCON.......Po|

Do you think it's worth even trying to do some kind of fallback?

dhowden commented 8 years ago

Ah yes, that's annoying. Although the standards are remarkably vague, it's very frustrating that some implementations chose to ignore the more well-defined parts!

My only hesitation here would be that we should really have some "config" to decide the default byte order when no mark is found. I do have an idea for a patch, so I'll put it in a separate branch and mention it here when done.

wader commented 8 years ago

Yeah hmm :( my guess is little endian is most common... some utf16 buffer on a x86 machine. Let me know when you push the branch.

Header from BBC podcast: r4choice_20131004-1530a.mp3.header.zip

dhowden commented 8 years ago

Thanks for the header data, made it much easier to test.

I've put my changes straight into master. Not hugely elegant, but it will fix your problem and is much better than just returning an error.

wader commented 8 years ago

Thanks, works great!

dhowden / tag

ID3v2 text frames with encoding type 1 but no BOM #18