Open gjoseph92 opened 2 years ago
Thanks for the feedback Gabe, I'll try the new version of stackstac and your suggested configuration at some point in the near future.
By the way all the benchmarking tooling is documented here:
https://odc-stac.readthedocs.io/en/latest/benchmarking.html
And stackstac
specific parts are here:
If you see some issues with the way benchmarking code interfaces with stackstac then let me know, PRs are welcome too.
@gjoseph92 I have now tried the latest code with (-1,1,2048,2048)
chunking regime. While there is some improvement in completion time it depends on image size/read scale.
Run time improves from 165.9s to 124.5s, or in terms of throughput 18.1 -> 24.1 Mpix/sec. This is about 25% faster than the version I tested few months back.
============================================================
Will write results to: s2-ms-mosaic_2020-06-06--P1D_20220217T014246.297250.pkl
method : stackstac
Scenario : s2-ms-mosaic_2020-06-06--P1D
T.slice : 2020-06-06
Data : 1.3.90978.10980.uint16, 5.58 GiB
Chunks : 1.1.2048.2048 (T.B.Y.X)
GEO : epsg:32735
| 10, 0, 499980|
| 0,-10, 9200020|
Cluster : 1 workers, 4 threads, 32.00 GiB
------------------------------------------------------------
T.Elapsed : 127.712 seconds
T.Submit : 0.424 seconds
Throughput : 23.465 Mpx/second (overall)
| 5.866 Mpx/second (per thread)
------------------------------------------------------------
T.Elapsed : 126.062 seconds
T.Submit : 0.132 seconds
Throughput : 23.772 Mpx/second (overall)
| 5.943 Mpx/second (per thread)
------------------------------------------------------------
T.Elapsed : 124.488 seconds
T.Submit : 0.243 seconds
Throughput : 24.073 Mpx/second (overall)
| 6.018 Mpx/second (per thread)
============================================================
Submit time is way faster, down from about 2 seconds to 0.13, which is consistent with the smaller Dask graph this approach should generate. There are probably some memory use improvements, but my tooling doesn't track that yet.
So in this case (reading 1/8 overview with 2048 chunk size) it went down from 29.3 to 27.5. That's about 6% better, so I guess at lower scale data load costs are more dominant than mosaic building so improvements are less pronounced.
============================================================
Will write results to: s2-ms-mosaic_2020-06-06--P1D_20220217T013436.959713.pkl
method : stackstac
Scenario : s2-ms-mosaic_2020-06-06--P1D
T.slice : 2020-06-06
Data : 1.3.11373.1374.uint16, 89.42 MiB
Chunks : 1.1.2048.1374 (T.B.Y.X)
GEO : epsg:32735
| 80, 0, 499920|
| 0,-80, 9200080|
Cluster : 1 workers, 4 threads, 32.00 GiB
------------------------------------------------------------
T.Elapsed : 29.068 seconds
T.Submit : 0.200 seconds
Throughput : 1.613 Mpx/second (overall)
| 0.403 Mpx/second (per thread)
------------------------------------------------------------
T.Elapsed : 31.943 seconds
T.Submit : 0.013 seconds
Throughput : 1.468 Mpx/second (overall)
| 0.367 Mpx/second (per thread)
------------------------------------------------------------
T.Elapsed : 27.461 seconds
T.Submit : 0.011 seconds
Throughput : 1.707 Mpx/second (overall)
| 0.427 Mpx/second (per thread)
============================================================
👋 Hey, just happened to stumble across this. I know you're not done with the final report yet, but just wanted to say a huge thank-you for doing this work—this is really valuable to have!
I'm not at all surprised to see that stackstac's performance was much worse for the "wide" case. That's come up before, and I'm realizing it's just a fundamental tradeoff that comes from representing the entire stack as a dense array. You get a lot of expressiveness from doing this immediately, and having the grouping/compositing/mosaicing steps happen in xarray syntax, but the downside is what you've shown here. With
odc-stac
, you're pushing the compositing operation into the data-loading phase, which allows you to be much smarter (as you discuss). There's talk about doing this in stackstac too: https://github.com/gjoseph92/stackstac/issues/66.I'd be curious though what the results look like using the latest additions to stackstac (on main, but not yet released):
Particularly, passing
chunksize=(-1, 1, N, N)
tostackstac.stack
will mean that all the items within one spatial tile will be loaded at once, instead each item-spatial tile combo being tracked as its own chunk, which leads to the current explosion of graph size you mentioned. Combined with slightly better mosaic logic, I'm hoping this will improve performance a bit. I'd still expectodc-stac
to be faster for the wide case, even with this, but just hoping stackstac will be less bad.