Closed bweasels closed 5 months ago
There are an awful lot of GET calls in there that, for a whole-array-write operation are totally unnecessary. Even if the array already exists, a single read of all the metadata pieces should suffice. I wonder where we can provide a better directory listing caching experience around this. There is, separately, talk of using a transactional in-memory cache specifically for zarr metadata files (upload when finished) that would help a lot too. It is already possible to provide separate metadata and data storage backends in zarr.
I mention this because, while I don't know what the specific problem is, I can only but assume that the total number of requests/coroutines is implicated by something from deep within asyncio.
One lever you could pull on is the fsspec config setting conf["nofiles_gather_batch_size"]
(default given by fsspec.asyn._NOFILES_DEFAULT_BATCH_SIZE=1280
) to a smaller value.
If there are really requests being made with zero data, we should be able to find out where that's happening and continue on. Perhaps there is a race condition where all the data of a chunk is sent successfully, but the sending function subsequently errors. This would be in gcsfs.core.simple_upload
.
every batch of 10 chunks
Is this the number of zarr chunks in a dask partition, or where else does this number come from?
Thanks for the fast reply! This number is the number of images you want to hold in memory before writing them to the bucket, so user defined really.
WRT the large number of GET
for every batch upload - while trying to debug the race condition, I tried reconnecting to the zarr store on the bucket every time I wrote a set of 10 images to try to see if the error was related to the connection going stale (idk - I'm a scientist, not a networking guy). Removing that re-connection call removes the stack of GET
calls.
Thanks for the pointer on gcsfs.core.simple_upload
- I'll see if I can explore it to do some debugging for my weird case. If it'll help, I'll try to make a minimal reproducible example this weekend.
I was able to manually trace it back to _request
on line 412 in gcsfs.core. It seems like the:
async with self.session.request( ... ) as r:
command on line 416 may be where it fails prior to going into the race condition. The data object going into that command prior to failure is not empty, (<gcsfs.core.UnclosableBytesIO object at 0x0000021B6ACB7BF0>
with a non-zero size from getvalue), so I'm guessing its something in self.session.request
? That said, it seems like self.session.request
comes from another package (aiohttp?), so I ran out of steam and stopped pursuing it. Given that this is the first you're seeing of this, this may be specific to my situation, but maybe this issue thread can help if someone else has this issue. I chatted with the lab and we'll pursue a different, slower uploading schema to get around this. Thanks again for your help!
I hope you are right, but good to provide this information for others anyway
Hi all - feel free to let me know if you feel that this may actually be a zarr issue, but I'm having a very frustrating bug while uploading zarr files to a google cloud bucket with gcsfs.
Context We have many terabytes of microscopy imaging data (on the order of 500 fov x 14 cycles) on a storage server that we need to upload to a google cloud bucket for analysis with our VM. We have lab members who need to upload data from their own experiments, and our workstation is generally reserved for data acquisition. This unfortunately means that ideally lab members would run the upload from their own machines, so the experiments are too large to load entirely into memory or to save to a disk prior to upload. As a result, I do the following:
Problem After anywhere from 2 - 1000 successful chunk writes to the server, the jupyter notebook becomes unresponsive, uses a ton of CPU, and can only be resolved by restarting the kernel. Occasionally it will throw out the following error message (repeated 1000s+ times), and other times it will silently fail.
Troubleshooting I've done the following:
try except
to retry the connection on failure (try except doesn't catch the callback error)On file open:
For every batch of 10 chunks written it repeats the following:
And just prior to failure it writes out this (seems to be same as all other successful chunk writes)
For reference, the
<bucket name>
and<filename>
are the actual bucket & filename in the error message. Please let me know if you need anymore information, feel that this is not a gcsfs related issue, or see anything glaringly wrong in how I'm handling the upload. Thanks in advance!