fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
302 stars 78 forks source link

`mzz.translate()` much slower in 0.0.6 #145

Closed lsterzinger closed 2 years ago

lsterzinger commented 2 years ago

I'm helping a colleague use kerchunk to access GOES-16 satellite data on AWS. At first I was running Kerchunk 0.0.5 and combining 288 references took a few seconds, nothing special.

We updated to 0.0.6 because there was an issue with some of the coordinates and I figured we'd benefit from the MultiZarrToZarr update. Now it I let it run for over 10-15 minutes and it never finished. Not quite sure what's going on, or if I'm missing some important optimization somewhere.

Here is the notebook: https://nbviewer.org/gist/lsterzinger/37c035542f7fac8f7e02a7c736877b60

martindurant commented 2 years ago

It's totally possible that the new MZZ is much slower than previous, I didn't make a great effort to optimise for performance. However, it's also quite likely that there's something else going on, for example maybe many <500byte chunks are being inlined, and fetching there is currently a serial process, so many such chunks really add up.

The thing to do would be to try with say 3 inputs, and then 10, and see the time it takes, and also of course to profile (snakeviz, and lprun on MultiZarrToZarr.second_pass), so we know where time is being spent.

lsterzinger commented 2 years ago

3 files ~ 30 seconds 10 files ~ 1 min 30 seconds

So this appears to scale linearly.

Running a profile on mzz.second_pass() (second_pass.zip) shows that nearly all of the time is spent on ~:0(<method 'acquire' of '_thread.lock' objects>) which doesn't really tell us anything... I've run into this before and I don't really know how to investigate further

image

martindurant commented 2 years ago

That looks very much like it's fetching data, which would be for inlining. Please see how much different it is if you set inline_threshold=0 . I can fix the big latency, though. It would be interesting to know how many times cat is being called here.

martindurant commented 2 years ago

Oh, I see: 2071 calls, at about 43ms each - sounds like pure latency, so we can easily do 100 of these caslls simultaneously.

lsterzinger commented 2 years ago

So this is actually with an inline_threshold=0... reference.py:246(cat) is being called 2071 times

martindurant commented 2 years ago

MultiZarrToZarr.second_pass only calls cat here, which only happens for a file-size smaller than the threshold. Are you sure you are passing it correctly?

In any case, the following, I think is enough to get one order of magnitude of concurrency:

--- a/kerchunk/combine.py
+++ b/kerchunk/combine.py
@@ -274,6 +274,7 @@ class MultiZarrToZarr:
         no_deps = None

         for i, fs in enumerate(self.fss):
+            to_download = {}
             m = fs.get_mapper("")
             z = zarr.open(m)

@@ -367,9 +368,13 @@ class MultiZarrToZarr:
                     key = key.rstrip(".")

                     if fs.info(fn)["size"] < self.inline:
-                        self.out[key] = fs.cat(fn)
+                        to_download[key] = fn
                     else:
                         self.out[key] = fs.references[fn]
+            if to_download:
+                bits = fs.cat(list(to_download.values()))
+                for key, fn in to_download.items():
+                    self.out[key] = bits[fn]
         self.done.add(3)

Even better would be to group these reads by "remote fs" (the .fs attribute of each of the mzz's fss).

martindurant commented 2 years ago

I can confirm that using

mzz = MultiZarrToZarr(
    metadata_list,
    remote_protocol='s3',
    remote_options = {'anon' : True},
    concat_dims='t',
    inline_threshold=0
)

reduces processing time to a trivial 90ms for 10 inputs.

lsterzinger commented 2 years ago

Yep that was my mistake... the initial individual reference list was generated with inline_threshold=0 but neglected to set the same in MultiZarrToZarr

Is the 90ms result from the release version or the changes from https://github.com/fsspec/kerchunk/issues/145#issuecomment-1095074674?

martindurant commented 2 years ago

Correct - there were no cat fetches at all, so that code is not touched. It still should be done, though, since inlining is a good thing for write-once/read-many.

lsterzinger commented 2 years ago

Is the value of inline_threshold the number of bytes? Do you have a value you usually use that balances processing time with read time?

Le 11 avr. 2022 à 07:08, Martin Durant @.***> a écrit :

Correct - there were no cat fetches at all, so that code is not touched. It still should be done, though, since inlining is a good thing for write-once/read-many.

— Reply to this email directly, view it on GitHub https://github.com/fsspec/kerchunk/issues/145#issuecomment-1095100212, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTH32TK3HB72FBWTEU4IZDVEQW7TANCNFSM5S22GY5Q. You are receiving this because you authored the thread.

martindurant commented 2 years ago

Yes bytes and ... it depends! The main bottleneck is usually the size of the output JSON, so for the end user, a balance of dataset open time versus access time.

martindurant commented 2 years ago

So, in short, I don't think there'd a regression here, but there is more we can do to speed up inlining when it is desired. We can start with my diff, but should be able to do better.

lsterzinger commented 2 years ago

Awesome, thanks! I'll try things out with your diff sometime this week

lsterzinger commented 2 years ago

Finally got around to testing the changes you made to mzz @martindurant. Running this code from the notebook linked in the original issue:

%%time
mzz = MultiZarrToZarr(
    metadata_list,
    remote_protocol='s3',
    remote_options = {'anon' : True},
    concat_dims='t',
    inline_threshold=300
)

with open('combined.json', 'w') as outf:
    outf.write(ujson.dumps(mzz.translate()))

Prior to the changes in https://github.com/fsspec/kerchunk/issues/145#issuecomment-1095074674:

CPU times: user 2min 3s, sys: 11.5 s, total: 2min 15s
Wall time: 33min 51s

Including your changes:

CPU times: user 59.9 s, sys: 4.41 s, total: 1min 4s
Wall time: 4min 19s

So the time savings are significant!

edit: should mention that this was combining 288 references

martindurant commented 2 years ago

Is this now comparable to before?

I have yet to implement a decent version of the concurrent fetching I made a POC of...

lsterzinger commented 2 years ago

I don't have any data on what this would have looked like before, but my recollection is that it never took longer than 1 minute in any usecase I had, so this still feels slower

martindurant commented 2 years ago

OK, well maybe that's OK, since we are doing more work than before, checking the coords of every input rather than guessing them (you could provide them explicitly instead, if you have previous information...). So will close this now.