NCrusher74 / SwiftTaggerID3

A Swift library for reading and editing ID3 tags
Apache License 2.0
5 stars 2 forks source link

Description/content strings not being written properly #6

Closed NCrusher74 closed 4 years ago

NCrusher74 commented 4 years ago

I'm having a hard time figuring out why the comment and lyrics description strings aren't being encoded properly.

But the description strings for userDefinedText and userDefinedURL frames, which use the same frame type and code, are.

Here's the encodeContents function.

    // encode the contents of the frame to add to an ID3 tag
    func encodeContents(version: Version) throws -> Data {
        var frameData = Data()
        // append encoding byte
        frameData.append(StringEncoding.preferred.rawValue)

        if self.layout == .known(.comments) ||
            self.layout == .known(.unsynchronizedLyrics) {
            // encode and append language string
            frameData.append(self.languageString?.encoded(withNullTermination: false) ?? "und".encoded(withNullTermination: false))
        }

        // encode and append description string
        frameData.append(self.descriptionString?.encoded(withNullTermination: true) ?? "".encoded(withNullTermination: true))

        // encode and append contents string
        frameData.append(self.contentString.encoded(withNullTermination: false))
        return frameData
    }

Here is what I see when the encoding is utf16WithBOM:

Screen Shot 2020-05-12 at 3 03 05 PM Screen Shot 2020-05-12 at 3 00 57 PM

The one with the red bars is from Kid3, and as you can see, it's not reading anything for those fields. However, the column to the left, where is says "Comment" repeatedly? That should contain the descriptionString for each individual comment frame. It should only say "Comment" there if the frame doesn't have a descriptionString.

For comparison, here is what it looks like for both the boilerplate and customized UserDefinedText frames:

Screen Shot 2020-05-12 at 3 08 47 PM

All those values in the left column are the descriptionString for that frame.

The one with the Han characters is MediaInfo.

If I change the encoding to isoLatin1, this is what I see instead:

Screen Shot 2020-05-12 at 3 01 54 PM Screen Shot 2020-05-12 at 3 02 22 PM

What you see there in the right column of the Kid3 image are the descriptionString, rather than the contentString. Same with MediaInfo. It's displaying the descriptionString but not the contentString.

Meanwhile, here is what Yate is showing:

Screen Shot 2020-05-12 at 3 12 12 PM Screen Shot 2020-05-12 at 3 12 35 PM

Again, this is encoded utf16WithBOM. When it's in isoLatin1 it's basically the same thing, just not in gibberish.

Screen Shot 2020-05-12 at 3 15 09 PM Screen Shot 2020-05-12 at 3 15 22 PM

What is happening here is that it's got nothing in the descriptionString field, and then it's putting both the descriptionString AND the contentString in the contentString field.

Compare to userDefinedText:

Screen Shot 2020-05-12 at 3 16 43 PM

It's got to be something either in the encodedContents function above, or in the set function, which is different for comments/lyrics frames than it is for the userDefinedText/URL frames, due to the presence or absence of a language string:

    internal mutating func set(_ layout: FrameLayoutIdentifier,
                      _ frameKey: FrameKey,
                      in language: String,
                      to description: String?,
                      with content: String) {
        let frame = LocalizedFrame(layout,
                                   languageString: language,
                                   descriptionString: description,
                                   contentString: content)
        self.frames[frameKey] = .localizedFrame(frame)
    }

    internal mutating func set(_ layout: FrameLayoutIdentifier,
                      _ frameKey: FrameKey,
                      to description: String?,
                      with content: String) {
        let frame = LocalizedFrame(
            layout, languageString: nil,
            descriptionString: description ?? "",
            contentString: content)
        self.frames[frameKey] = .localizedFrame(frame)
    }

I've can't spot where the difference is that is causing this to happen to comments/lyrics and not to userDefinedText/userDefinedWebpage.

SDGGiesbrecht commented 4 years ago

Again, I haven’t looked closely yet, but Han characters are often an indication that there is one byte too many or too few at the start of the string, or that the BOM is backwards (or being ignored). In UTF‐16, each character is two bytes. If you add or remove one from the front, then then it fissures the entire sequence of pairs. Many Latin letters turn into Han ideographs when you shift them like that.

NCrusher74 commented 4 years ago

I've still got the preferred encoding set to isoLatin1 for the moment, just because it makes it easier to see what I'm looking at when utf16WithBOM is coming out as gibberish or not at all in other apps, but here are the bytes for the Comment frame:

0 65 6e 67 0 -- eng
43 6f 6d 6d 65 6e 74 0 -- Comment
43 6f 6d 6d 65 6e 74 20 43 6f 6e 74 65 6e 74 0 -- Comment Content

So it looks like every string is getting a null terminator, when only the descriptionString should be terminated.

But they shouldn't be?

    // encode the contents of the frame to add to an ID3 tag
    func encodeContents(version: Version) throws -> Data {
        var frameData = Data()
        // append encoding byte
        frameData.append(StringEncoding.preferred.rawValue)

        if self.layout == .known(.comments) ||
            self.layout == .known(.unsynchronizedLyrics) {
            // encode and append language string
            frameData.append(self.languageString?.encoded(withNullTermination: false) ?? "und".encoded(withNullTermination: false))
        }

        // encode and append description string
        if let encodedDescription = self.descriptionString?.encoded(withNullTermination: true) {
            frameData.append(encodedDescription)
        }
        // encode and append contents string
        frameData.append(self.contentString.encoded(withNullTermination: false))
//        print(frameData.hexadecimal())
        return frameData
    }

Hrm.

NCrusher74 commented 4 years ago

Okay, I finally figured it out. The language string needed to not be encoded.

SDGGiesbrecht commented 4 years ago

Okay, I have time again. It sounds like you’ve already solved the two issues still sitting it my inbox though? One was the language string having a terminator it shouldn’t that caused characters to sheer in two and become Chinese. The other was the tag length being recorded with or without synchsafe backwards to what was needed, which caused the tag to be truncated early. Is there anything else you are still waiting for an answer about?

NCrusher74 commented 4 years ago

Yeah, that's all that is outstanding for the moment. After I figure out why the Genre tag test isn't passing for versions 2.2/2.3, even though it looks perfect when looking at it in other apps, I'm going to merge this branch and start a new one for tackling the Date frame, which is just going to be a blast, because I still get turned around on how the date formatting works, and it's different enough this time around from the last time we addressed date formatting that I'm not getting very far using what we did before as an example.

Thank you!

NCrusher74 commented 4 years ago

Okay, I think I may have bumped into another slight problem with extractPrefixAsStringUntilNullTermination. Or maybe it's confusing over how encoding for this frame should be handled, I'm not sure which.

There is an encoding byte, so I don't believe we're dealing with a situation where the strings are supposed to be ASCII.

What is happening is that, for versions 2.2 and 2.3, the second element in the string array (the freeform string, as opposed to the preset string) is being lost.

        // standard boilerplate, it's a text frame (of sorts) so strings are encoded
        var parsing = contents
//        print(parsing.hexadecimal()) -- See screenshot below
        let encoding = try PresetOptionsFrame.extractEncoding(data: &parsing, version: version)
        // initialize an empty array to store the parsed strings in
        var parsedArray: [String] = []

        // versions 2.2 and 2.3 handle strings differently than version 2.4.
        // 2.2 and 2.3 relies on parentheses, while 2.4 just uses null termination
        switch version {
            // v2.2 and v2.3 use parentheses to denote codes
            case .v2_2, .v2_3:
                // extract the full string
                let unparsedString = parsing.extractPrefixAsStringUntilNullTermination(encoding) ?? ""
//                print(unparsedString) - the "Genre Type" string is gone
                // parse out the parentheses and return the array of parsed strings
                parsedArray = PresetOptionsFrame.parseParentheticalString(unparsedString: unparsedString)
Screen Shot 2020-05-13 at 9 53 07 PM

So it looks like the string that says "Genre Type" is going missing after extractPrefixAsStringUntilNullTermination.

Looking at the screenshot, it looks like there may be a null terminator in there. But I don't understand why.

                for item in genreMediaOrFileInfo {
                    if item != "" {
                        if let itemEncoded = item?.encoded(withNullTermination: false) {
                            frameData.append(itemEncoded)
                        }
                    }
            }

Why is it getting a null terminator when it explicitly says it shouldn't?

Unfortunately, unlike my issue earlier with the language string in the LocalizedFrame I can't fix this by removing the encoding, because I know it's supposed to be encoded. Or at least, I assume so. The spec says:

(when discussing encoding):

Strings dependent on encoding is represented as

And then later when laying out the scheme for text frames:

<Header for 'Text information frame', ID: "T000" - "TZZZ",
 excluding "TXXX" described in 4.2.2.>
 Text encoding                $xx
 Information                  <text string according to encoding>

The genre frame isn't explicit about encoding, or lack thereof, but the description does refer to it as a number string:

The 'Content type', which previously was stored as a one byte numeric value only, is now a numeric string.

And the spec explicitly says that numeric strings should be ISO-8859-1:

All numeric strings and URLs [URL] are always encoded as ISO-8859-1.

Looking at our string encoding function, I don't see where the bool for null termination has any effect, unless I'm missing something?

    func encoded(withNullTermination: Bool) -> Data {
        let encoding = StringEncoding.preferred
        guard var result = data(using: encoding.standardLibraryEncoding) else {
            // This will never happen unless “preferred” is changed to something besides Unicode.
            fatalError("\(encoding) cannot encode “\(self)”.")
        }
        let null = Data(repeating: 0x00, count: encoding.sizeOfTermination)
        result.append(contentsOf: null)
        return result
    }

I've changed it to this:

    func encoded(withNullTermination: Bool) -> Data {
        let encoding = StringEncoding.preferred
        guard var result = data(using: encoding.standardLibraryEncoding) else {
            // This will never happen unless “preferred” is changed to something besides Unicode.
            fatalError("\(encoding) cannot encode “\(self)”.")
        }
        if withNullTermination == true {
            let null = Data(repeating: 0x00, count: encoding.sizeOfTermination)
            result.append(contentsOf: null)
        }
        return result
    }

But that is breaking things in a different way, I think, because now instead of the first string in the array being fine and the second string getting lost, instead they're both being returned as a single string. headdesk Which is probably because I removing the parentheses on the writing side, since no other apps seem to be willing to bother extrapolating from the genre codes to an actual genre.

SDGGiesbrecht commented 4 years ago

I don't see where the bool for null termination has any effect, unless I'm missing something?

Both your observation and the fix are correct with regards to that method.

because now instead of the first string in the array being fine and the second string getting lost, instead they're both being returned as a single string

That may not have been the problem. Maybe the null is supposed to be there between the strings. I say that because there is also a leading null in the little image you posted. Unless you copied and pasted too much, that image has three strings: "", null, "Audiobook", null, "Genre Type", null. Is it possible the first two strings were being read (one into the wrong field) and the third dropped? Then by removing the trailing null from the end of "Audiobook", you would have had two strings: "", null, "AudiobookGenre Type"? If so, maybe the extra leading null is landing there from whatever comes before it, and that is the culprit instead?

NCrusher74 commented 4 years ago

I'm pretty sure that leading null is the encoding byte, since I'm still using isoLatin1 as preferred at the moment.

Looking at the place where I printed that out, it's before I extracted the encoding byte:

        var parsing = contents
//        print(parsing.hexadecimal()) -- See screenshot below
        let encoding = try PresetOptionsFrame.extractEncoding(data: &parsing, version: version)

So, disregarding that first null, what we actually had was: Audiobook, null, Genre Type, null. Which was fine for version 2.4, since the strings are supposed to be null terminated there, but not for 2.2 and 2.3.

So, what is happening now is that "Audiobook,Genre Type, null" is being sent to my function to parse out the parentheses, which starts out by separating the components of a string based on the presence of an open parenthesis, but there's no parentheses to be parsed out so it's just treating it all as a single string.

NCrusher74 commented 4 years ago

Well, it took me the entirety of another night of effort, and a teeny bit more willful disregard for the specs, but I got the PresetOptionsFrame working on both the reading and writing (and reading back what I've written) side.

It's working. If I ever even think about tinkering with it again, remind me in none-too-gentle language that that way lies utter lunacy.

(In case you're wondering how I disregarded the specs, technically the FileType frame isn't supposed to be delimited by parentheses the way Genre and MediaType are, but without the parens, there's no way to tell what's a freeform string and what's one of the official types/refinements, so I put them in there anyway.)

Tomorrow: the date frame. That will be fun.

NCrusher74 commented 4 years ago

Merging and closing this branch!