Try using `chunksize=(-1, 1, N, N)` with stackstac?

gjoseph92 commented 2 years ago

👋 Hey, just happened to stumble across this. I know you're not done with the final report yet, but just wanted to say a huge thank-you for doing this work—this is really valuable to have!

I'm not at all surprised to see that stackstac's performance was much worse for the "wide" case. That's come up before, and I'm realizing it's just a fundamental tradeoff that comes from representing the entire stack as a dense array. You get a lot of expressiveness from doing this immediately, and having the grouping/compositing/mosaicing steps happen in xarray syntax, but the downside is what you've shown here. With odc-stac, you're pushing the compositing operation into the data-loading phase, which allows you to be much smarter (as you discuss). There's talk about doing this in stackstac too: https://github.com/gjoseph92/stackstac/issues/66.

I'd be curious though what the results look like using the latest additions to stackstac (on main, but not yet released):

Particularly, passing chunksize=(-1, 1, N, N) to stackstac.stack will mean that all the items within one spatial tile will be loaded at once, instead each item-spatial tile combo being tracked as its own chunk, which leads to the current explosion of graph size you mentioned. Combined with slightly better mosaic logic, I'm hoping this will improve performance a bit. I'd still expect odc-stac to be faster for the wide case, even with this, but just hoping stackstac will be less bad.

Kirill888 commented 2 years ago

Thanks for the feedback Gabe, I'll try the new version of stackstac and your suggested configuration at some point in the near future.

By the way all the benchmarking tooling is documented here:

https://odc-stac.readthedocs.io/en/latest/benchmarking.html

And stackstac specific parts are here:

https://github.com/opendatacube/odc-stac/blob/e6131c26daddab7d70eacb43b7e625a0b29e8769/odc/stac/bench/_run.py#L396-L410

If you see some issues with the way benchmarking code interfaces with stackstac then let me know, PRs are welcome too.

Kirill888 commented 2 years ago

also here:

https://github.com/opendatacube/odc-stac/blob/e6131c26daddab7d70eacb43b7e625a0b29e8769/odc/stac/bench/_run.py#L345-L369

Kirill888 commented 2 years ago

@gjoseph92 I have now tried the latest code with (-1,1,2048,2048) chunking regime. While there is some improvement in completion time it depends on image size/read scale.

Native resolution read

Run time improves from 165.9s to 124.5s, or in terms of throughput 18.1 -> 24.1 Mpix/sec. This is about 25% faster than the version I tested few months back.

============================================================
Will write results to: s2-ms-mosaic_2020-06-06--P1D_20220217T014246.297250.pkl
method      : stackstac
Scenario    : s2-ms-mosaic_2020-06-06--P1D
T.slice     : 2020-06-06
Data        : 1.3.90978.10980.uint16,  5.58 GiB
Chunks      : 1.1.2048.2048 (T.B.Y.X)
GEO         : epsg:32735
            | 10, 0, 499980|
            | 0,-10, 9200020|
Cluster     : 1 workers, 4 threads, 32.00 GiB
------------------------------------------------------------
T.Elapsed   :  127.712 seconds
T.Submit    :    0.424 seconds
Throughput  :   23.465 Mpx/second (overall)
            |    5.866 Mpx/second (per thread)
------------------------------------------------------------
T.Elapsed   :  126.062 seconds
T.Submit    :    0.132 seconds
Throughput  :   23.772 Mpx/second (overall)
            |    5.943 Mpx/second (per thread)
------------------------------------------------------------
T.Elapsed   :  124.488 seconds
T.Submit    :    0.243 seconds
Throughput  :   24.073 Mpx/second (overall)
            |    6.018 Mpx/second (per thread)
============================================================

Submit time is way faster, down from about 2 seconds to 0.13, which is consistent with the smaller Dask graph this approach should generate. There are probably some memory use improvements, but my tooling doesn't track that yet.

1/8 resolution read

So in this case (reading 1/8 overview with 2048 chunk size) it went down from 29.3 to 27.5. That's about 6% better, so I guess at lower scale data load costs are more dominant than mosaic building so improvements are less pronounced.

============================================================
Will write results to: s2-ms-mosaic_2020-06-06--P1D_20220217T013436.959713.pkl
method      : stackstac
Scenario    : s2-ms-mosaic_2020-06-06--P1D
T.slice     : 2020-06-06
Data        : 1.3.11373.1374.uint16,  89.42 MiB
Chunks      : 1.1.2048.1374 (T.B.Y.X)
GEO         : epsg:32735
            | 80, 0, 499920|
            | 0,-80, 9200080|
Cluster     : 1 workers, 4 threads, 32.00 GiB
------------------------------------------------------------
T.Elapsed   :   29.068 seconds
T.Submit    :    0.200 seconds
Throughput  :    1.613 Mpx/second (overall)
            |    0.403 Mpx/second (per thread)
------------------------------------------------------------
T.Elapsed   :   31.943 seconds
T.Submit    :    0.013 seconds
Throughput  :    1.468 Mpx/second (overall)
            |    0.367 Mpx/second (per thread)
------------------------------------------------------------
T.Elapsed   :   27.461 seconds
T.Submit    :    0.011 seconds
Throughput  :    1.707 Mpx/second (overall)
            |    0.427 Mpx/second (per thread)
============================================================

Kirill888 / benchmark-report

Try using `chunksize=(-1, 1, N, N)` with stackstac? #2

Native resolution read

1/8 resolution read