RobloxAPI / spec

Specifications related to Roblox.
Creative Commons Attribution Share Alike 4.0 International
13 stars 3 forks source link

How would I convert sharedstrings from rbxl to rbxlx sharedstrings... #16

Open realrunnow opened 1 year ago

realrunnow commented 1 year ago

How would I convert sharedstrings from rbxl to rbxlx?

00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 As it was stated, the hash is never being used, and is just filled with zeros, but is this entire sharedstrings system ever being used, or is it just a compatiblity thing? the string doesnt even contain anything

From what I have gathered, these strings look exactly the same in any rbxlx. It now has a hash, and an an empty string again?

<SharedStrings>
    <SharedString md5="yuZpQdnvvUBOTYh1jqZ2cA=="></SharedString>
</SharedStrings>

I have never seen the sharedstring system used, which would be very helpful when making an rbxl to rbxlx converter.

Anaminus commented 1 year ago

It's only the hash itself that isn't used. SharedStrings are used quite often. For example, by UnionOperations that haven't been uploaded yet.

For the binary format, the 0-based index of the SharedString in the SSTR chunk is used to select the string (i.e. a SharedString property with index 2 selects the third SharedString in the chunk).

For the XML format, the hash is used similarly to instance references; it's just some arbitrary string that will map to a SharedString uniquely per file. Roblox chooses to use a BLAKE2B digest of the string's content, but it could just as easily be a numeric index. As long as it's encoded in Base64, any value should work.

To convert between two formats, you could store all the decoded SharedStrings in a list. Then you have a map that maps each string's decoded index to an index of this list. Then, as you decode SharedString properties, you use the index map to convert from the decoded-index to the list-index. Finally, as you encode SharedString properties, you convert the list-index as needed to fit the encoding format.

Here's a toy example:

local SharedStrings = {}
local SharedStringIndexes = {}

function DecodeSharedStringChunk(chunk)
    for _, sharedString in chunk do
        -- Add SharedStrings to list.
        table.insert(SharedStrings, sharedString.Value)
        local decodedIndex = sharedString.Index
        local listIndex = #SharedStrings
        -- Map decoded-index to list-index.
        SharedStringIndexes[decodedIndex] = listIndex
    end
end

function DecodeSharedStringProperty(property)
    -- Convert decoded-index to list-index.
    return {
        Value = SharedStringIndexes[property.Value],
    }
end

function EncodeSharedStringIndex(index)
    -- Let's say this format expects a hexadecimal index.
    return string.format("%X", index)
end

function EncodeSharedStringChunk()
    local chunk = {}
    for index, value in SharedStrings do
        table.insert(chunk, {
            Index = EncodeSharedStringIndex(index),
            Value = value,
        })
    end
    return chunk
end

function EncodeSharedStringProperty(property)
    return {
        Value = EncodeSharedStringIndex(property.Value),
    }
end
realrunnow commented 1 year ago

Could you give an example of how an sstr table looks like with strings in it, and how they are used as properties in the rbxlx format? Did I understand correctly that you need to keep track of how are the strings ordered- what their order is, in both formats?

Anaminus commented 1 year ago

You only need to keep track of the order locally to each format. What matters most is that SharedString properties are mapped to the correct SharedString value.

Here are some sample files containing UnionOperations. Both files represent the same data.

unions.zip

realrunnow commented 1 year ago

hello, I finally understand how they work, but now I have enother issue. In which format is the shared string's value. It is not utf-8 like in the meta chunk it seems, as I am getting an error when trying read the string in that format.

Anaminus commented 1 year ago

From either format's point of view, it's just raw bytes that doesn't require any additional conversion.

realrunnow commented 1 year ago

But then how is it in the form of a string in rbxlx

realrunnow commented 1 year ago

Well, the problem I am having with this is that this data is being represented by a string in the rbxlx format, to which I am unable to convert to from the rbxl file format.

Anaminus commented 1 year ago

As it says in the spec, RBXLX encodes the raw bytes of a SharedString value in Base64.

For example, the encoded form of the raw bytes Hello, world! with index 0 would look like this:

<SharedStrings>
    <SharedString md5="MA==">SGVsbG8sIHdvcmxkIQ==</SharedString>
</SharedString>
realrunnow commented 1 year ago

Hi! I finally managed to get the converter to produce (almost) the same result as seen in the xml format. the only difference is there are newlines in different locations, than for the xml formats base64 encoded strings. Does that matter? It probably doesnt, right?

Anaminus commented 1 year ago

I believe the Base64 decoder ignores any whitespace, so it should be fine.

realrunnow commented 1 year ago

oh, okay, that seems like that probably is it. But now I am facing a new problem, and that is interleaving in the references table for the INST chunk. I just overall cannot understand it, and how it maps to the xml format as well(the ids)

Anaminus commented 1 year ago

I have a reference implementation of interleaving:

https://github.com/RobloxAPI/rbxfile/blob/master/rbxl/arrays.go#L45-L96

Interleaving is like interpreting an array of bytes as a matrix, then transposing it. Because it operates on bytes, it is the first step applied when decoding, an the last step applied when encoding.


It's a lot easier to handle arrays in RBXL by breaking their encoding into steps, where each step applies a transformation. For example, the spec defines the References type as the following:

[]zint32b~4

[]             array of
  z            zigzag-encoded
   int         integers
      32       32 bits in size
        b      in big-endian
         ~4    interleaved at 4 bytes

additionally, the array is difference-encoded

To encode an array of references to an array of bytes, transformations can be applied sequentially, in order. Each transformation may change the type the values.

  1. Start with an array of references ([]reference).
  2. Difference encode the array ([]reference).
  3. Convert each value to an int32 ([]int32).
  4. Encode each value with zigzag encoding ([]uint32).
  5. Write each value to bytes in big-endian ([]byte).
  6. Interleave the array ([]byte).
  7. End with an array of bytes ([]byte).

To decode, these steps are applied in reverse.

  1. Start with an array of bytes ([]byte).
  2. Deinterleave the array ([]byte).
  3. Read each value from big-endian bytes ([]uint32).
    • 32 bits is 4 bytes, so each group of 4 bytes makes one value.
  4. Decode each value with zigzag encoding ([]int32).
  5. Convert each value to a reference ([]reference).
  6. Difference decode the array ([]reference).
  7. End with an array of references ([]reference).

As for the references themselves, the situation is similar to SharedStrings. Instead of the SharedString.md5 attribute, it concerns the Item.referent attribute, which is also an arbitrary string. Roblox chooses to use a randomly generated GUID, but it could easily be a numeric index.


In general, if you need reference implementations to look to, there are several:

I don't recall exactly, but I think there are also implementations in JavaScript and Lua somewhere.

realrunnow commented 1 year ago

oh, so the referent in the rbxlx can be anything?

realrunnow commented 1 year ago

btw, is it normal that the references range from 0 to 300?

realrunnow commented 1 year ago

oh, and also, the references are always dividable by 4... I guess that might be a bug in my code- or it may actually help, idk

realrunnow commented 1 year ago

oh and one more thing- there seems to be a new value type- SecurityCapabilities/0x21/33

It seems to only be 0 in the xml format, so I suppose it is the same thing for the binary format, but I didnt get to reading the values yet..