Implement functions for working with binary keyvalues

rossengeorgiev commented 8 years ago

Those are used in richpresence and other places, would be useful for the steam module eventually.

More info: https://github.com/SteamRE/SteamKit/issues/108

[x] implement binary dump/load functions
[x] verify against real data from steam
[x] document usage and type hinting in README

SleepProgger commented 8 years ago

Also used in the PICSProductInfoResponse for the package data.
I'll port the version pysteamkit uses and submit a pull request in a bit.

SleepProgger commented 8 years ago

I finished the first version of the parse_binary function.
826a5d2234c2a691ef2a7d9ea6b545f12fa3a6e9 It works so far for every package data i tested it with. Some test data (base64 encoded) at http://pastebin.com/mf16sHQh . Let me know what you think.

Some notes:

node-steam.js uses an int32 for the Color and Pointer types. I guess we should do the same. https://github.com/seishun/node-steam/blob/master/lib/VDF.js At the moment i raise an exception on those types.
There seem to be an Widestring type but no implementation i saw used it, so i guess we can assume it isn't used widely in steam.
pysteamkit (and it seems node-steam.js, too) is using the native byte order. I am not sure if that is correct and steam knows about the byte order of the client(?). I used explicit little endian, that looks correct, but we should check if it is.
_read_till_seperator is a bit inefficient and we could improve the performance by caching all separator indices we find when reading a chunk, but i am not sure the increased complexity is really worth the performance benefit.

Additionally, i am not sure if creating the load_binary and loads_binary functions or just supplying an optional binary parameter to load/s would be better..

Edit: Added experimental version of dump_binary. b1f96ab700e7486b91a6c0dbe142ecf11ed5d238 It needs the scalars in the form (data_type, value). IMHO, that is a bit strange, but we NEED to know as which type to save the data (esp numeric values). Additionally there is en/decoding missing, plus the same types as in the parse_binary function are missing (BIN_POINTER, BIN_WIDESTRING, BIN_COLOR).

rossengeorgiev commented 8 years ago

Let's just deal with bytes strings. Then we can use indexing and slicing to parse it. I suggest find method for slicing out bin_string.

>>> x = b'bbb\x00aaaa\x00'
>>> x[0: x.find(b'\x00', 0)]
b'bbb'
>>> x[4: x.find(b'\x00', 4)]
b'aaaa'

If we want to make load/dump symmetric then the only way is to use objects for point, widestring, and color.

rossengeorgiev commented 8 years ago

I think that covers everything. I haven't tested it on any real data yet.

SleepProgger commented 8 years ago

I am almost finished with my version, too. IMHO it is a bit cleaner. https://github.com/SleepProgger/vdf/blob/parse_binary_impl/vdf/vdf_binary.py

Changes:

I am using some type wrapper now to contain the types (int, uint, ...). This allows us to reserialize data pretty nicely.
There was a bug in _dump_gen_binary that is now fixed.
I changed Pointer and Color to an UINT32, because that is the most reasonable format for them i guess.
Everything works now with python 2.7 and 3 (at least the two versions i tested)
- I moved all the stuff to another file, to not clutter the vdf namespace

So far i tested it with all the packages i could get from steam, and deserializeing them and serializing them again returned the correct results. The problem is, so far all the data i tested was of the types String and int32. I am looking for other testdata (especially with widestrings in it).

If you want i could merge your changes with mine ?

rossengeorgiev commented 8 years ago

Did you see my implementation? It's feature complete and fully tested.

Here is quick perf comparison.

In [1]: import vdf, vdf_binary
In [2]: raw = vdf.binary_dumps(dict(map(lambda x: (str(x), x),
range(5000))))
In [3]: %timeit vdf.binary_loads(raw)
100 loops, best of 3: 15.1 ms per loop
In [4]: %timeit vdf_binary.parse(raw)
10 loops, best of 3: 54.1 ms per loop
In [5]: raw = vdf.binary_dumps(dict(map(lambda x: (str(x), str(x)*100),
range(5000))))
In [6]: %timeit vdf.binary_loads(raw)
100 loops, best of 3: 15 ms per loop
In [7]: %timeit vdf_binary.parse(raw)
10 loops, best of 3: 103 ms per loop

SleepProgger commented 8 years ago

Well, your parsing version doesn't support streams (and therefore is ofc faster with memory data ;)), but i can live with that. After rereading your code, they are pretty similar in most cases.

Some stuff i noticed:

You declared the struct instances, but aren't using them in _binary_dump_gen (~ int32.pack(value))
I guess POINTER and COLOR should be unsigned ints (at least for pointer that sounds pretty right)
- IMHO it would be wise to conserve if the value was a WIDESTRING, too just in case.

SleepProgger commented 8 years ago

Good news everybody, I tested your implementation against all current existing productInfo i could get (about 3000 objects), and everything went well, and all data is correctly parsed and dumped ( except the minor issue mentioned in #5 ). All the test data can be found here. I will update that repo with other files (from richpresence and other sources which may exist)

Every single binary vdf blob i tested only contained int32 and string types btw.

ValvePython / vdf

Implement functions for working with binary keyvalues #3