binary plist writer - Githubissues

iostat commented 5 years ago

Hey!

I thought I'd take a stab at #28 as a means of getting more familiar with Rust (as well as because I need it for another project I'm working on). I opted to not make it a streaming writer like the XML writer is, as you need to know the length of any root object (i.e., array or dictionary) anyway and would end up re-creating much of the logic of Value::from_events. There are some inefficiencies that probably need resolving, and any feedback would be appreciated. Additionally, I think some of the writer mutation could be better encapsulated, but it's a start.

Included are two tests for roundtripping the binary plists in data/, and they seem to work fine.

ebarnard commented 5 years ago

Hi.

First off, thanks for looking into and implementing this. I've got a few thoughts about the implementation that I'd like to talk about but I'm feeling a little ill at the moment so I might not get back to you for a couple of days.

iostat commented 5 years ago

Hey! No rush at all, feel better!

ebarnard commented 5 years ago

Sorry this took so long.

I see two approaches a binary plist writer could take.

Load the whole plist into memory, deduplicate, and write (as is done in this PR).
Stream in events and start writing output as soon as possible. For this to have any advantage over the prior method it would not deduplicate.

Apple's and Python's plist implementation both take the first approach. I think the cost of holding the plist in memory seems reasonale - on my laptop there are 10 or so plists in the 10s of MBs but the vast majority are around or less than 1MB - so we should also take this approach.

There are also a couple of requirements the writer should have:

serde serialisation should be as efficient as possible as this is the most likely use case.
We should not allocate much more than the size of the set of Events/Value comprising any plist.

Unfortunately the current implementation suffers from major memory blowup. Writing a 1MB Data nested in 10 arrays will allocate (I think) 11MB as every element of every array is cloned and stored separately for deduplication. This is fine in python as it stores references to every element instead of performing deep copies. I can't see a way to have the writer operate on Values, deduplicate, and also not suffer from memory blowup. Have you got any ideas?

I've got a concept for an alternative implementation which avoids memory blowup while still retaining deduplication. It has the added bonus of avoiding Builder which I have an irrational dislike of.

iostat commented 5 years ago

Yeah, the reason I didn't go for a streaming writer directly is because you don't have any way of knowing how large of a ref size you'd need. e.g., if i have a dictionary with 2 KV pairs, it might seem sufficient to use a u8 as the ref size, but if one of those values is an array with 250 values (or other dictionaries, nested arrays, etc.) then a u8 is insufficient, and since you can't "unwrite" to go back and change the ref size, you're effectively forced to actually store the root value in-mem up until its done and you know exactly how big the refsizes have to be.

Due to the way flatten works, it shouldn't be duplicating values (in fact, quite the opposite). I think much of the blowup is due to all the clones in the write logic when iterating over the ref table and object list. The reason those are there is because older rust versions wouldn't let me to refer to those fields in an immutable borrow as i'd then be making a mutable borrow to call self.write_x inside the loop. There is possibly an unnecessary cloning in flatten_inner for dictionaries, mostly because I wasn't sure if there's any benefit to putting references to all keys in order followed by values other than to keep the reader from seeking around too much (and that's how it was done in Apple's and Python's implementations).

I think most of those can be mitigated with better encapsulation of the mutation within the writer. Perhaps some sort of FlattenedValue type can be created which can be built off a stream of events and converted into a map of (ref -> value), and the writer itself can use that instead of the current VecWriter->Value intermediary? This way you get the "streaming" benefit, have better encapsulation of mutation, while still hopefully maintaining memory usage consistent with the size of the input set (i.e. constant factor, but not too large of one)

ebarnard commented 5 years ago

Due to the way flatten works, it shouldn't be duplicating values (in fact, quite the opposite).

In flatten_inner we call upsert_to_object_list recursively for every value in the plist. If the value is not already in object_ref_table, upsert_to_object_list clones the value twice and inserts it once in object_list and again in object_ref_table. This is what causes the memory blowup.

Imagine a plist with the structure below:

value 1  |  Array [
value 2  |      Array [
value 3  |          Array [
value 4  |              Data(1MB of data)
         |          ]
         |      ]
         |  ]

value 1 is the root value and is stored in the value field of a BinaryWriter. This uses 1MB of memory.

flatten is called which in turn calls flatten_inner for value 1. As value 1 has not been seen before upsert_to_object_list clones it twice and stores the copies in object_list and object_ref_table. clone performs a deep copy and so we have used approximately 3MB of memory.

flatten_inner is now called for value 2. As with value 1 this has not been seen before and so it is cloned and stored twice. We are now at 5MB.

The same happens for values 3 and 4 at which point we have used 9MB of memory.

This is pretty much unavoidable if we want to both useValues as the internal storage type and deduplicate arrays and dictionaries instead of just leaf values. Note that Apple's implementation only deduplicates leaf values:

// Do not unique dictionaries or arrays, because: they
// are slow to compare, and have poor hash codes.
// Uniquing bools is unnecessary.
while we may have better hashing, the comparisons will be slow especially for large Values.

I've ended up implementing my idea for an event based writer in #36. It's a little hard to follow at the moment as there's a lot of added complexity due to the requirement to write dictionaries as key, key, ..., value, value, ... instead of interleaved.

ebarnard / rust-plist

binary plist writer #35