Firestore.Decoder is slow compared to JSONDecoder

collinhundley commented 2 weeks ago

Description

Decoding an array of QuerySnapshotDocument using Firestore.Decoder is very slow, compared to decoding a similar array using JSONDecoder:

// 50k documents
let documents = try await myCollection.getDocuments().documents
// Avg 9.6s decode
let items = documents.compactMap{try? $0.data(as: Item.self)}

// Same 50k items
let data = try JSONEncoder().encode(items)
// Avg 2.4s decode
let result = try JSONDecoder().decode([Item].self, from: data)

What's more is that we don't gain any performance from parallelization. For example, if we split documents into 10 arrays of 5k documents each, and initialize one Firestore.Decoder for each group:

let items = await withTaskGroup(of: [Item].self) { group in
    // chunkedDocuments: [[QueryDocumentSnapshot]]
    for chunk in chunkedDocuments {
        group.addTask { 
            let decoder = Firestore.Decoder()
            return chunk.compactMap{try? $0.data(as: Item.self, decoder: decoder)}
        }
    }
    var items = [Item]()
    for await result in group {
        items.append(contentsOf: result)
    }
    return items
}

...the total time is worse: about 10.3s.

So there are 2 issues:

1) Why is Firestore.Decoder 4x slower than JSONDecoder? 2) Does Firestore.Decoder utilize some shared internal state that prevents instances from running independently on different threads?

Reproducing the issue

No response

Firebase SDK Version

10.29

Xcode Version

16

Installation Method

Swift Package Manager

Firebase Product(s)

Firestore

Targeted Platforms

iOS, macOS

Relevant Log Output

No response

If using Swift Package Manager, the project's Package.resolved

Expand Package.resolved snippet

```json Replace this line with the contents of your Package.resolved. ```

If using CocoaPods, the project's Podfile.lock

Expand Podfile.lock snippet

```yml Replace this line with the contents of your Podfile.lock! ```

google-oss-bot commented 2 weeks ago

I couldn't figure out how to label this issue, so I've labeled it for a human to triage. Hang tight.

mortenbekditlevsen commented 2 weeks ago

I'm not working at Google, but I was a contributor of some of this functionality, so I have some input. :-)

The encoder and decoder shipped with Firebase is based on the open source implementation that was part of Swift until it was recently replaced by a much more performant implementation.

The 'old' Swift implementation of JSONDecoder started out with Data, then used (an Objective-C implementation of) JSONSerialization to convert the Data into a structure built from Dictionaries and Arrays of data, representing the JSON structure. After this, the structure was decoded into the actual Decodable types.

Since both Firestore and the Realtime Database have similar structural representations (nested dictionaries and arrays), the decoder in firebase is basically a copy of what Swift used to do, but it skipped the initial conversion from Data to a structure performed by JSONSerialization.

So this is basically what it is. There are also some details in that the structural representation in Firestore can contain types that are not from JSON - like the built-in Timestamp and GeoPoint, but all in all it's just a copy of what the Swift Foundation JSONDecoder used to do.

Since then, JSONSerialization was re-written - and JSONDecoder too. JSONSerialization has an internal mode to not convert into Dictionaries and Arrays, but represent the parsed data as an enum where each case represents a JSON value type.

This representation is then used in the re-implemented JSONDecoder and the stronger typing is probably part of why it's so much faster.

So - that's the state of the current decoder - and perhaps partly an explanation about why the built-in JSONDecoder is faster.

Would it be possible to use the newer Foundation implementation in Firebase too? The enum representation of the JSON structure is entirely internal to Foundation, so it can't directly be used. But it could of course be cloned. The issue would then be to get Firestore to fetch snapshots in this optimized structure rather than as structures of Dictionaries and Arrays. I haven't looked at the Firestore code enough to be able to judge if that is possible. There is also the question of the 'native' Firestore types that are not JSON-compatible.

With regard to your question of parallelization, there's nothing in the actual decoder that does any form of synchronization. So I'm guessing that it's actually the accessing of the value on a Firestore document snapshot that does some synchronization. One guess would be that it fetches data from a shared caching layer, and access to that could require the synchronization?

You could try instrumenting your sample code above to see where most of the time is being spent in both the serial and parallel examples - perhaps there are some low hanging fruits?

All of this is just my 2 cents. :-)

MarkDuckworth commented 2 weeks ago

@collinhundley, I'm marking this as a feature request for a more performant Firestore.Decoder. We will continue to listen for community feedback to help us prioritize the effort.

In the meantime, your approach of reducing the time to decode seems reasonable. Are you able to instrument your code as @mortenbekditlevsen suggested, or even add some log statements simple logging to confirm that your tasks are indeed running in parallel? Are you running these tests on a simulator or on an actual device?

collinhundley commented 2 weeks ago

@mortenbekditlevsen thank you for the detailed response.

With regards to parallelization, I did some further testing which leads me to believe that document access is not the bottleneck here. We can remove document access from the equation by converting all documents to dictionaries up front:

let dictStart = CACurrentMediaTime()
let dictionaries = documents.map{$0.data()}
print(CACurrentMediaTime() - dictStart) // 0.75s
let chunkedDictionaries = dictionaries.chunked(into: 5000) // [[[String : Any]]]
let decodeStart = CACurrentMediaTime()
let items = await withTaskGroup(of: [Item].self) { group in 
    for chunk in chunks {
        group.addTask {
            let decoder = Firestore.Decoder()
            return chunk.compactMap{try? decoder.decode(Item.self, from: $0)}
        }
    }
    var items = [Item]()
    for await result in group {
        items.append(contentsOf: result)
    }
    return items
}
print(CACurrentMediaTime() - decodeStart) // 10.9s

So 94% of the time is spent on actual decoding.

Profiling confirms that 10 threads are fully utilized concurrently on my M1 Max. Here's a sample from one of these threads, expanded a few levels:

100.0%    0 s         closure #1 in closure #2 in DataLayer.getAllItems(for:source:)
100.0%    0 s          specialized Sequence.compactMap<A>(_:)
100.0%    0 s           specialized Sequence.compactMap<A>(_:)
100.0%    0 s            closure #1 in closure #1 in closure #2 in DataLayer.getAllItems(for:source:)
100.0%    1.00 ms         FIRFirestore.Decoder.decode<A>(_:from:)
99.9%     2.00 ms          FirebaseDataDecoder.decode<A>(_:from:)
99.9%     0 s               __JSONDecoder.unbox<A>(_:as:)
99.9%     0 s                __JSONDecoder.unbox_(_:as:)
99.4%     0 s                 protocol witness for Decodable.init(from:) in conformance Item
99.4%     1.00 ms              Item.__allocating_init(from:)
99.4%     2.00 ms               Item.init(from:)
96.3%     0 s                    protocol witness for KeyedDecodingContainerProtocol.decode<A>(_:forKey:) in conformance _JSONKeyedDecodingContainer<A>
2.8%      1.00 ms                protocol witness for KeyedDecodingContainerProtocol.decodeIfPresent(_:forKey:) in conformance _JSONKeyedDecodingContainer<A>
0.3%      0 s                    protocol witness for Decoder.container<A>(keyedBy:) in conformance __JSONDecoder
0.0%      0 s                    protocol witness for KeyedDecodingContainerProtocol.decodeIfPresent<A>(_:forKey:) in conformance _JSONKeyedDecodingContainer<A>
0.4%      32.00 ms            static FirestorePassthroughTypes.isPassthroughType<A>(_:)
0.1%      4.00 ms             dynamic_cast_existential_1_conditional
0.0%      0 s                 _JSONDecodingStorage.push(container:)
0.0%      2.00 ms           FirebaseDataDecoder.__allocating_init()
0.0%      1.00 ms           FIRFirestore.Encoder.keyEncodingStrategy.setter

I'm a little stumped - any other ideas where synchronization might be occurring?

CC @MarkDuckworth

MarkDuckworth commented 2 weeks ago

What do you see if you print elapsed time at the beginning of each task? Or maybe print a start and stop identifier for each task?

...
    for chunk in chunks {
        group.addTask {
            print("start decoding \(chunkIdentifier)")
            print(CACurrentMediaTime() - decodeStart) 

            let decoder = Firestore.Decoder()
            let result = chunk.compactMap{try? decoder.decode(Item.self, from: $0)}

            print("stop decoding \(chunkIdentifier)")
            return result
        }
    }

collinhundley commented 2 weeks ago

Ok, another update:

By manually testing a variety of batch sizes, I am able to squeeze out some extra performance by reducing the number of tasks to 4 (with 12,500 documents each). The decoding time is reduced to 5.2s - down from the original 9.6s single thread.

So I was wrong to say that the decoder can't be parallelized at all. However, there seems to be a lot of performance left on the table... I originally chose 10 tasks to correspond to my 10 M1 Max cores, and I know that's the limit that TaskGroup will run concurrently on my machine.

Are we maxing out some other resource then? I can't manually tweak the batch size for every CPU out there... it would be preferable to let TaskGroup take advantage of all available cores.

collinhundley commented 2 weeks ago

What do you see if you print elapsed time at the beginning of each task? Or maybe print a start and stop identifier for each task?

@MarkDuckworth I've done exactly this, and interestingly they all start and stop at nearly the same time:

for (i, chunk) in chunks.enumerated() {
    group.addTask {
        print("Start \(i): \(CACurrentMediaTime())")
        let decoder = Firestore.Decoder()
        let res = chunk.compactMap{try? decoder.decode(Item.self, from: $0)}
        print("Finish \(i): \(CACurrentMediaTime())")
        return res
    }
}

Start 0: 87516.98317362502
Start 1: 87516.98320329169
Start 2: 87516.98323158335
Start 3: 87516.98324508335
Start 4: 87516.98336229169
Start 5: 87516.98339979169
Start 6: 87516.98344200003
Start 7: 87516.98347045835
Start 8: 87516.98390000002
Start 9: 87516.98391875002
Finish 6: 87527.66197037502
Finish 0: 87527.68356191668
Finish 5: 87527.68613600002
Finish 2: 87527.68647504169
Finish 1: 87527.68823379169
Finish 7: 87527.69325829169
Finish 4: 87527.69522533336
Finish 3: 87527.69923245836
Finish 9: 87527.70061004169
Finish 8: 87527.70436450002

Wouldn't you expect the efficiency cores to take a bit longer than the rest?

firebase / firebase-ios-sdk