CosmicHorrorDev / vdf-rs

VDF parsing and (de)serialization for Rust
Apache License 2.0
31 stars 3 forks source link

The Great Parsing #27

Open CosmicHorrorDev opened 3 years ago

CosmicHorrorDev commented 3 years ago

From finding out how to extract contents from .vpk files in #26 we now have over 60k VDF files to test parsing with just from the contents of a few Valve games

The full corpus is much too large and probably a nono to include in here, but I'll hack together a program that tries to parse each file and dump any ones that fail to a separate location. Once I get that running I'll post any failures here

CosmicHorrorDev commented 3 years ago

Exactly two files (the same exact same contents) use some weird platform tag identifier thing like so

"Foo"
{
    "Bar" [$WIN32]
    {
    }
}

Handling this would probably be a pain especially since I have no clue what possible values there are and I also don't know how all it can be applied (I'm assuming the above would make "Bar" and its value considered Windows 32-bit exclusive, can it also be applied to a value that is a string? Where else could it be used?

CosmicHorrorDev commented 3 years ago

It seems common to still use \ as a path separator instead of escaping a character. I suppose the easiest way to handle this would be to have escape characters to not be parsed by default and add an option to parse them since they seem incredibly rare

CosmicHorrorDev commented 3 years ago

It seems somewhat common to include a null byte at the end of the file. Not sure if this is packed file specific and just isn't handled right or if this is present normally (Hopefully it's just the former for consistency)

CosmicHorrorDev commented 3 years ago

Some files failed to read because they're not UTF-8 encoded. Need to dig into the different encodings used. It may be reasonable to expect users to handle encoding and convert it to UTF-8 for us

CosmicHorrorDev commented 3 years ago

It looks like the platform specific tags may be more common and do seem to indicate the platform that a value is used for. Here's a snippet from another file

"xpos"  "r223" [$WIN32]
"xpos"  "r223" [$X360HIDEF]
"xpos"  "r220" [$X360LODEF]

This also shows that it can be used on values that are strings as well. The full set of tags that I've seen so far are WIN32, WIN32WIDE, X360, X360HIDEF, X360LODEF, X360WIDE, DEMO, ENGLISH, JAPANESE, KOREAN, etc. and beyond that there looks to be some conditional logic that can be used as well like [$WIN32 && $ENGLISH] or [$WIN32 && !$ENGLISH]

The parsing position is a bit awkward as well since it can appear at the end of a pair for Key-String, but between the two tokens for Key-Obj. With how many different possible values there are it doesn't seem worth trying to parse specifics, we could just return the string for what's inside

Of the 16,353 failures this is included in 345

CosmicHorrorDev commented 3 years ago

The number of files that used #base are 292 of the 16,353 failures.

Of those files it appears that #base always appears on the top value. I'll have to dig in more to see if #base was ever used with a file that also has a #base

CosmicHorrorDev commented 3 years ago

Finally the number of files that use \ when not trying to represent an escaped character are 15,079 which makes it a very prevalent issue.