fischman-bcny commented 2 years ago

Previously combineLatest's implementation would trigger a stack overflow on 15000 publishers because the implementation combined its inputs linearly. This commit replaces that with repeated (hierarchical) 4-way merges, which lets the same scaled test scenario go to 1e7 publishers with no issues (other than taking 28s to run the test on my M1 Pro, hence the choice to commit the test with only 1e5 publishers). Test execution wall-time scales linearly with publisher count.

Re: "4-way merge":

Why not use a higher value? 4 is the highest arity offered by Combine's Publisher.combineLatest.
Why not use a lower value? Using 2-way merges instead of 4-way still lets the 1e7 test case pass, but is ~60% slower (1e7 publishers take 45s insted of 28s).

All of the above applies remarkably similarly to zip, also changed in the same way in this PR.

freak4pc commented 2 years ago

Thanks for the PR! What's the use case you have trying to CombineLatest 100k publishers ? It seems interested but might be heavily over-optimized with losing the current code clarity, so I want to make sure there's a clear value to this.

Also, with the existing implementation, how many "long seconds" does it take?

fischman-bcny commented 2 years ago

Thanks for the PR!

Thanks for the quick look!

What's the use case you have trying to CombineLatest 100k publishers ?

1e5 is overkill for my use-case, but the test in question is super-light on what each publisher does. Context on this PR is that I had attempted to replace uses of a home-grown (much longer but incomplete) version of combineLatest with CombineExt's version in a large (but closed-source, sorry) project and got push-back during review that CombineExt's version had known scaling limits, and a worry that "real" (much heavier-weight than Just(2) :) ) publishers in the project in question might trigger this limit with "only" hundreds of publishers. This PR is an attempt to address that push-back :)

BTW this exchange also prompted me to binary-search a tighter bound; turns out the current impl crashes at 14474 publishers. Updated PR description to reflect this.

It seems interested but might be heavily over-optimized with losing the current code clarity, so I want to make sure there's a clear value to this.

ISTM that any API that accepts a Collection and turns it into unoptimiz{ed,able} tail-called recursion is fraught, because an array with 10k elements isn't something that screams "huge! consider the implications!" to me. (unlike, say, an array with 1e9 elements). Turning your question around: would it be better to keep the current implementation of combineLatest and add a comment saying "consider the scaling implications on collections of more than N publishers" for some explicit value of N?

Also, with the existing implementation, how many "long seconds" does it take?

Thanks for asking and making me look! It took something like 20-40s to go from starting the test to showing the crash in Xcode, but because the test would pause at the crash (in lldb) I didn't have an exact duration. Attempting to find that duration by running the test outside lldb revealed that it's actually super-fast, too, and that it's lldb's generation of a backtrace and Xcode displaying it that was taking all those "long seconds". Elided that claim from the PR's description.

freak4pc commented 2 years ago

Thanks for the PR!

Thanks for the quick look!

What's the use case you have trying to CombineLatest 100k publishers ?

1e5 is overkill for my use-case, but the test in question is super-light on what each publisher does. Context on this PR is that I had attempted to replace uses of a home-grown (much longer but incomplete) version of combineLatest with CombineExt's version in a large (but closed-source, sorry) project and got push-back during review that CombineExt's version had known scaling limits, and a worry that "real" (much heavier-weight than Just(2) :) ) publishers in the project in question might trigger this limit with "only" hundreds of publishers. This PR is an attempt to address that push-back :)

BTW this exchange also prompted me to binary-search a tighter bound; turns out the current impl crashes at 14474 publishers. Updated PR description to reflect this.

It seems interested but might be heavily over-optimized with losing the current code clarity, so I want to make sure there's a clear value to this.

ISTM that any API that accepts a Collection and turns it into unoptimiz{ed,able} tail-called recursion is fraught, because an array with 10k elements isn't something that screams "huge! consider the implications!" to me. (unlike, say, an array with 1e9 elements). Turning your question around: would it be better to keep the current implementation of combineLatest and add a comment saying "consider the scaling implications on collections of more than N publishers" for some explicit value of N?

Also, with the existing implementation, how many "long seconds" does it take?

Thanks for asking and making me look! It took something like 20-40s to go from starting the test to showing the crash in Xcode, but because the test would pause at the crash (in lldb) I didn't have an exact duration. Attempting to find that duration by running the test outside lldb revealed that it's actually super-fast, too, and that it's lldb's generation of a backtrace and Xcode displaying it that was taking all those "long seconds". Elided that claim from the PR's description.

Thanks for the detailed response.

Here's I'm feeling a bit indecisive, I want to share a bit of my thinking process :)

I'll start from the end here:

because an array with 10k elements isn't something that screams "huge! consider the implications!" to me.

I'm having a hard time with this sentence. 10k elements in an array, sure. But 10k Publishers, where each publisher can be some work, and each change to any of the 10,000 elements would emit a recombination of all 10,000 elements — that isn't really a common use case. To be honest, I've been working with Reactive Tech for ... 8 years, and I've never had to CombineLatest 10,000 (or even 1000) different Observables/Publishers.

The code you wrote is more effective, for sure, but it loses a lot of the readability — This isn't about being "pretty". It's more about "can a different contributor come in easily and feel comfortable working on this", and I'm worried it's probably a "No", which is a relatively big sacrifice.

So my current thinking is that I'm not entirely sure this change is worth the cost. I'd love to get @jasdev's opinion because from one side he understands that balance IMO, and from the other end he hit these performance limits himself, too.

I will say the use-case with "Zip" is actually much more obvious - zipping 100,000 requests is extreme, but I can think of some use cases. And once they're done, it's done. CombineLatest is basically never-ending with that amount of publishers, so I'm very curious about the use case here.

Anyways, I'm leaving this open, and I hope @jasdev will be able to chime in. I'll try to ping him directly if not.

Again, thank you so much for the optimization, it's a really interesting discussion which is very hard to find a good balance on 🙏

fischman-bcny commented 2 years ago

Thanks for the thoughts, @freak4pc. Funny you mention zip b/c I just got done applying the same treatment to it (I had forgotten @jasdev's blog post implicated it as well until I cited it to you above :) ). Added to this PR. Looking forward to hearing Jasdev's thoughts.

jasdev commented 2 years ago

Sorry for the delayed reply here and love all the discussion!

My lean is to land this change. With concurrency being added directly into Swift, the writing is on the wall that Combine won’t be receiving updates anymore. Which, in my opinion, means it probably isn’t worth it to try and write custom variadic zip and combineLatest publishers à la the precedent in RxSwift, OpenCombine, et al.

So, if writing a custom publisher — instead of the existing composed approach — is too much effort for a library (Combine) that likely won’t be updated and the existing approach has perf. bottlenecks, then @fischman-bcny’s PR seems like a nice middle ground to raise the perf. ceiling.

…it’s more about “can a different contributor come in easily and feel comfortable working on this,” and I’m worried it’s probably a “No,” which is a relatively big sacrifice.

After this change, the only updates I can realistically see for these operators is moving them over to a custom publisher, which for the above reason, I don’t think is worth the effort, unless someone wanted to work on it as a learning exercise. 😄

—

tl;dr: I’m pro-landing this change. ✅

fischman-bcny commented 2 years ago

@freak4pc back to you for an executive decision given @jasdev's input above, I think.

freak4pc commented 2 years ago

@fischman-bcny I'll go with the democratic decision and will accept the solution. Thanks for the patience! We have a company trip tomorrow so I won't be available until Thursday but I'll review when I'm back.

codecov[bot] commented 2 years ago

Codecov Report

Merging #130 (6ef96d8) into main (38a4d4c) will increase coverage by 0.05%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #130      +/-   ##
==========================================
+ Coverage   95.50%   95.56%   +0.05%     
==========================================
  Files          68       68              
  Lines        3781     3829      +48     
==========================================
+ Hits         3611     3659      +48     
  Misses        170      170

Impacted Files	Coverage Δ
Sources/Operators/CombineLatestMany.swift	`100.00% <100.00%> (ø)`
Sources/Operators/ZipMany.swift	`100.00% <100.00%> (ø)`
Tests/CombineLatestManyTests.swift	`97.65% <100.00%> (+0.24%)`	:arrow_up:
Tests/ZipManyTests.swift	`97.20% <100.00%> (+0.18%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 38a4d4c...6ef96d8. Read the comment docs.

fischman-bcny commented 2 years ago

@freak4pc hope you had a good trip :) Ping on this PR whenever you get a minute.

freak4pc commented 2 years ago

LGTM! I've made a few minor clean-ups to make it more readable, I'll let CI run and marge it.

fischman-bcny commented 2 years ago

@freak4pc I think you need to approve again for CI to run or merge be unblocked.

igor11191708 commented 2 years ago

Hi I've been also experimenting with the topic, slightly diff take on this https://github.com/The-Igor/d3-network-service/blob/main/Sources/d3-network-service/extension/combine/Zipper.swift fischman-bcny It's interesting what the test will show in terms of performance?!

fischman-bcny commented 2 years ago

Hi I've been also experimenting with the topic, slightly diff take on this https://github.com/The-Igor/d3-network-service/blob/main/Sources/d3-network-service/extension/combine/Zipper.swift fischman-bcny It's interesting what the test will show in terms of performance?!

@The-Igor our implementations appear to do the same thing except that mine uses a while loop where yours uses recursion to repeatedly process quads of inputs.

igor11191708 commented 2 years ago

I'm not trying to compete I'm trying to cooperate😊 It's really interesting is there difference in terms of performance. You are in the possession of the benchmark test. Sometimes slightly less instructions might lead to significant changes in performance. As I got all the line is how to improve the current solution as much as possible.

freak4pc commented 2 years ago

Not sure there was any sense of competition - if you have an alternative, feel free to measure it, benchmark it, and share it.

igor11191708 commented 2 years ago

Sorry if I did something wrong

fischman-bcny commented 2 years ago

@The-Igor No offense taken.

Inlining your impl into the test file, bumping numPublishers to 5e6+1, and cloning testZipAtScale to suffix with a 2 shows your impl is just a smidge slower in Debug:

 Test Case '-[CombineExtTests.ZipManyTests testZipAtScale]' passed (17.278 seconds).
 Test Case '-[CombineExtTests.ZipManyTests testZipAtScale2]' passed (17.682 seconds).

but nontrivially faster in Release:

Test Case '-[CombineExtTests.ZipManyTests testZipAtScale]' passed (16.262 seconds).
Test Case '-[CombineExtTests.ZipManyTests testZipAtScale2]' passed (13.185 seconds).

(note that if you want to run the tests in Release you'll have to comment out ReplaySubjectTests.swift's contents)

My guess is that this is due to your code using arrays for the quads (and the Array.count being faster than the piece-meal guards I have for nils in my (tuple-based) quads. I'll let @freak4pc & @jasdev speak to whether the increased performance is compelling for the CombineExt project. (I'm just a visitor here, motivated to make these operators not fall over in the presence of large inputs, and less motivated by performance optimizations for these scenarios)

Happy to do more work in this area if it's compelling for the project. Will proceed with this PR as-is to unblock downstream work (outside this project).

fischman-bcny commented 2 years ago

@freak4pc merge plz? (I don't seem to have permission to)

CombineCommunity / CombineExt

combineLatest & zip: make scalable using divide-and-conquer. #130

Codecov Report