Extremely slow encoding/decoding on a large number of items

Sega-Zero commented 8 years ago

I wrote an extension to serialize/deserialize dictionaries of cereal objects easily:

extension Dictionary where Key: CerealRepresentable, Value: CerealType {
    func serialize(to filePath: String) throws {
        var encoder = CerealEncoder()
        try encoder.encode(self, forKey: "dictionary")
        let data = encoder.toData()
        try data.writeToFile(filePath, options: [.AtomicWrite])
    }

    mutating func deserialize(from filePath: String) throws {
        guard let data = NSData(contentsOfFile: filePath) else { throw SerializationError.FileNotExists }

        let decoder = try CerealDecoder(data: data)
        guard let result: [Key: Value] = try decoder.decodeCereal("dictionary") else { throw SerializationError.IncorrectData }
        self = result
    }
}

And I faced with a very frustrating fact: when number of items to be serialized/deserialized counts in the thousands, Cereal consumes CPU a lot. For example, when I serialize 8k records, it takes about 5 seconds to run on my 6+ device: instruments3 2016-04-13 00-19-38

When I replaced all string[index1..<index2] with string.substringWithRange(index1..<index2) + prepared dictionary's capacity (very empirical estimation: assuming number of commas equals number of subitems):

private static func parseEncodedCerealDictionaryString<DecodedKeyType: protocol<Hashable, CerealRepresentable>, DecodedValueType: CerealType>(encodedString: String) throws -> [DecodedKeyType: DecodedValueType] {
    let scanner = NSScanner(string: encodedString)
    scanner.charactersToBeSkipped = nil
    var scanResult: NSString?
    var approximateCapacity = 0
    while !scanner.atEnd {
        scanner.scanUpToString(",", intoString: &scanResult)
        approximateCapacity += 1
        scanner.scanString(",", intoString: nil)
    }
    var decodedItems = Dictionary<DecodedKeyType, DecodedValueType>(minimumCapacity: approximateCapacity)
    try encodedString.iterateEncodedValuesWithInstantationHandler { keyType, keyValue, type, value in
        let decodedKey: DecodedKeyType = try CerealDecoder.instantiate(keyValue, ofType: keyType)
        decodedItems[decodedKey] = try CerealDecoder.instantiateCereal(value, ofType: type) as DecodedValueType
    }

    return decodedItems
}

the picture changed a little bit, showing the real reason of high load: instruments3 2016-04-13 00-29-54

There's too many CerealDecoder instantiations. Should cereal move to another implementation, maybe the one that is based on NSScanner, with a very little string reallocations count, with a decoder reuse instead of reallocating.

Couldn't invent a way how to do that with backward compatibility with the existing format, maybe you can help me in pointing to the right way?

Right now thinking of making a transformer to NSKeyedArchiver/from NSKeyedUnarchiver since the items in CerealEncoder are stored as strings.

ketzusaka commented 8 years ago

This is something I'm aware of and have done some experimentation on, but haven't found a suitable solution to just yet. Cereal wasn't designed to hold a massive amount of data from the get go-- I figured it'd be small chunks. I just haven't had the time to dig into a solution on it yet.

When I ran my own profiling on a large data set it seemed to be related to release/retains on an internal NSString, so this is interesting.

I'm not sure NSScanner would be great for the variable length inputs that the data currently stores as. I'm open to a 2.0, with breaking compatibility to resolve this.

My next crack at resolving this was going to be using a UTF16 view. The Strings section of the "Advanced Swift" book that recently came out talked about scan time, and in their example the UTF16 time was 18 times faster than the plain character view.

Sega-Zero commented 8 years ago

I think this is not a parse time issue (but this is a good thing to keep in mind too). From my experience in making TLV parsers, it is mostly always a string reallocation issue and should be solved with a single linear source of data, like StringBuilder in java.

Also, TLV parsing is easier and much faster, when comes in binary form. UInt type, UInt length, [UInt32] value. All strings may be transformed into an array of UInt32 character codes and vise versa, so no need in special control-characters, we can take UTF16 view and grab values from it, or init string with an array of UTF16 scalars.

I'll try to implement the way I see it in the weekend, but I'm not sure if I can refactor all the code =)

My first thought on Cereal, when I saw it, was "what a great alternative to NSCoder". So if it's not yet as awesome as NSCoder, it certainly should be :)

ketzusaka commented 8 years ago

I attempted to move everything to a single NSString that had an NSRange reference the necessary chunks to avoid the reallocation overhead, but that didn't solve the problem for me. It didn't even reduce the time in profiler =(

Thanks for digging into it!

Sega-Zero commented 8 years ago

Began to implement my vision of speed improvements. Implemented a few TLV structs. The next step is to have an internal dictionary of TLV structs in encoder and fast, lightweight initializer for CerealDecoder that will just copy a portion of raw bytes. Will see, if this will be faster. Maybe will have to work with a pointer to the root TLV struct. At least, now it is easier to get data from NSData.

What do you think of all this? Will continue to dig on a next weekend.

ketzusaka commented 8 years ago

Just wanted to let you know I do have this on my TODO list. Hopefully before the end of the week. Sorry about the delay :(

Sega-Zero commented 8 years ago

I've changed a little bit the things, but haven't pushed yet to my branch. Haven't changed the decoder itself, trying to speed up the encoder. Will push my experiments asap, maybe you could take a look at it and propose a better way. For now I only could speedup encoding from 2165.0 ms down to 433.0 ms, but, probably, there are more place to optimize there.

Sega-Zero commented 8 years ago

But the good news is that NSCoding experiment with the same data (but instead of structs, these are NSObjects) shows about 1100 ms, so my implementation now is twice faster =)

Sega-Zero commented 8 years ago

Pushed some commits to my branch. Will try to refactor decoder now, since I have no more thoughts on encoding improvements. If you'll have any ideas - please share them with me.

Also, all the tests should be rewritten after refactoring will be done.

As an illustration of improvement, attaching a screenshots of profiler: new version of encoder: instruments2 2016-04-28 02-05-09 old version of encoder: instruments2 2016-04-28 02-06-25

ketzusaka commented 8 years ago

@Sega-Zero Finally dug through it. Looks really awesome!

Some general code review thoughts:

func numberOfEntries() -> Int should probably be a property, and have a default implementation to reduce the requirements on conformers. Those who need to fine tune can then do override it, but a default of 5-10 properties probably makes sense.
Any place there is a function with no arguments and doesn't throw should also be properties imo
Generic parameters should have a more descriptive name
I'm noticing some unnecessary self.s in there that I'd like to see removed before a merge
I'd give every case in CerealCoderTreeValueType an explicit value so the cases can be reordered and have things inserted/removed randomly if wanted for readability.
I don't believe duplicate keys are handled correctly here since CerealEncoder.items was switched to an Array. If the user writes to the same key twice, the second write should be the value used and we shouldn't encode the first value.

I'm also curious how it would handle the stringMap not being inout. It doesn't seem like that is getting written back to, and Swift Dictionary/Arrays should be copy on write, so there shouldn't be any performance benefit to having them be inout.

Sega-Zero commented 8 years ago

Yeah, that's not a final state of sources, more like a proof of concept. So, about review, agree mostly with all with some additions:

thinking of additional protocol for numberOfEntries. Default protocols implementations may be dangerous
replacing a function to readonly property makes sense when using publicly outside of library, for the sake of code shortening. The internal or private functions may be kept, especially a functions that are harder than just returning a single value, imho.
Generic parameters names in CerealSerialization uses commonly used T, I feel comfortable enough about them. How would you name them?
totally agree about explicit values in CerealCoderTreeValueType, will add them in final preparations before a pull request
There is a penalty for adding a value to the dictionary. When holding huge number of items, a reallocations and inserts may give an additional ~150-200 ms of worktime, and I want this to be fast as hell. Appending to an array is way much simpler and faster operation. The concern about uniqueness is true, but I think it is better to serialize an array with a duplicate values to the file rather than wasting time in replacing them in the dictionary. Later, in decoder, only the last pair value will be used, so no worry about reading an old value.

There is a benefit for inout, since strings are written back to it in String's extenstion. See writeToBuffer(_,stringMap:) implementation. The slowest serialize operation is string encoding, so stringMap is used to get a byte array for duplicate strings. Using this dictionary, actually, give us a speed up with additional 20-30 ms. Without inout the speed slows down on copy operations.

Right now I'm not satisfied with tree-to-bytes convert process execution time, but couldn't invent something that may speed it up. I hope there is a way of getting strings bytes much faster than Array(string.utf8), or maybe even dumping a whole tree as is, but my question on SO is ignored. Will try to get to Mike Ash for an advice =)

Sega-Zero commented 8 years ago

Ok, I finally refactored a decoder too. The profiling results are not that satisfying as I wanted, but pretty impressive too: encoding decoding

These are a results of encoding/decoding a 10k structs. See this gist to see the code I use to profile all the things.

We can improve the speed by setting a decoding keys to 1 letter: encode: 372 ms decode: 493 ms

We can also win more time if we serialize employee name as a 2 strings instead of sub-structure: encode: 323 ms decode: 484 ms

So, a winning strategy is to avoid subtypes as much as possible in favor of flat data + reducing the length of decoding keys.

Now, I need your help. First, I need a code review =) Second, all the tests should be rewritten and I'm not sure what's the best way to do that on a binary data - should I prepare a test files, of should I include byte arrays in the tests body?

Sega-Zero commented 8 years ago

After some more searching in the web, I've found a solution that saves me more time in decoding! So, instead of 509 ms I have now 432 ms! Or 411 ms with decoding improvements mentioned above! instruments4 2016-05-02 02-24-02

I find this satisfying enough. So, I only need to rewrite the tests and I'm ready to send a pull-request.

Sega-Zero commented 8 years ago

I'm not sure I write a correct tests. Do that kind of test looks fine?

func testtoBytes_withEmptyString() {
        var subject = CerealEncoder()
        do {
            try subject.encode("", forKey: "string")
            let result = subject.toBytes()
            XCTAssertEqual(result, [11,25,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,9,1,6,0,0,0,0,0,0,0,115,116,114,105,110,103,1,0,0,0,0,0,0,0,0])
        } catch let error {
            XCTFail("Encoding failed due to error: \(error)")
        }
    }

ketzusaka commented 8 years ago

Heyo!

Few responses from the past messages you left:

I think a default implementation is fine. A lot of people seem scared of them, but I think its just because they are relatively new territory.
Renaming T to something like Value makes sense. I just want to be up-to-date on coding guidelines within Swift, and the T style is outdated in Swift
Would a Set work for storing items instead of an Array? If not, as long as the last encoded value is stored, an Array is fine.

Yeah, the test looks fine. The only thing I'd do is change the function name to "testToBytes_withEmptyString"; i know the To being capitalized doesn't match the method being tested, but it is a bit more readable and doesn't need to match 100%.

Thanks again for doing all this! It'll be a huge win!

Sega-Zero commented 8 years ago

A default implementation is not needed anymore, I've tweaked an algo a little bit :)
Renamed T to Type, will change the new string generic constructor tomorrow
Inserting to set is much heavier in terms of massive data, so yes, an array is faster. The latest implementation will store all the values that were encoded, but will decode only the latest one, so no worries about it ;)

Will finish tests refactoring tommorow and will be ready for pull request review =)

Sega-Zero commented 8 years ago

OK, I've finally refactored all the tests! Whoa, that was not that easy as I thought! Created a pull-request #19, ready for a code-review.

Also, I've prepared a new performance test:

import XCTest
@testable import Cereal

private let contact = TestEmployee(id: "12",
                                   name: NameData(givenName: "first", familyName: "last"),
                                   company: "company",
                                   data: [TestEmployee.ExtraInfo(data: "00000", kind: .Phone)],
                                   emails: [TestEmployee.EmailInfo(email: "mail@mail.com", kind: .Home), TestEmployee.EmailInfo(email: "mail@mail.com", kind: .Work)])

private let employees: [TestEmployee] = Array(count: 10_000, repeatedValue: contact)

class CerealPerformanceTests: XCTestCase {

    func testEncodeDecodePerformance() {
        measureBlock() {
            var encoder = CerealEncoder()
            let _ = try? encoder.encode(employees, forKey: "test")
            let data = encoder.toData()
            print("encode result: \(data.length) bytes")

            guard let decoder = try? CerealDecoder(data: data) else {
                XCTFail("unable to decode from encoded data")
                return
            }

            do {
                let decodedResult: [TestEmployee]? = try decoder.decodeCereal("test")
                print("decode result: \(decodedResult?.count)")
            } catch let error {
                XCTFail("unable to decode from encoded data with \(error)")
            }
        }
    }
}

but when a run it, it uses a debug code, which is a lot more havier than a release. Couldn't find a way to run this test in release, so I didn't include this to pull-request.

leolobato commented 8 years ago

Quick heads-up: I was using Cereal 1.3 on an app and was looking at a way to improve the persistence performance, so I updated to 2.0.

I haven't done any real profiling yet, but it appears 2.0 is significantly slower than 1.3, at least on my data (a large tree of class and struct), so it might be a good idea to investigate a bit more on the perfomance impact of this change. On 1.3, my data took 461 ms to persist. On 2.0, it takes 837 ms.

Also, it would be good to add a heads-up on the README since 2.0 is not backwards compatible with 1.3 and people might update without noticing and lose data.

Sega-Zero commented 8 years ago

Could you provide some samples to see a regression with profiling? Also, a large number of leafs in serialize tree will affect the speed both on 1.3 and 2.0 versions, so it is always better to keep it as flat as possible.

ketzusaka commented 8 years ago

aye, profiling would be nice. I'll update the readme.

leolobato commented 8 years ago

I ended up using regular NSCoding with some custom encoding for the enums and structs and it was 10x faster than using Cereal 1.3 on a tree of ~100 objects, so there's a lot of room to improve here. :)

I'll try to come up with some sample data and profile it later, but I can confirm it really was an issue for our app, so I'd be cautious about pushing that to master.

Sega-Zero commented 7 years ago

Closing this since no sample provided. Feel free to comment here with some samples attached, I'll definitely fix that!

Weebly / Cereal

Extremely slow encoding/decoding on a large number of items #16