HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
126 stars 52 forks source link

Uploading 30 GB file takes too long to upload #329

Closed assaron closed 3 months ago

assaron commented 3 months ago

Hi, I'm not sure whether this issue should be filed here or for https://github.com/HDFGroup/h5pyd but here it is.

I'm serving a POSIX HSDS server from the official docker with the default configuration on the same machine as the client (i.e. HS endpoint is http://localhost:5101).

I'm trying to upload a 34 GB file (https://s3.dev.maayanlab.cloud/archs4/files/mouse_gene_v2.3.h5) and it takes almost two days. Last time I tried it stopped after 39 hours with an error:

ERROR 2024-03-25 12:15:40,527 ERROR : failed to copy dataset data (slice(160292, 163935, 1),): HTTPConnectionPool(host='localhost', port=5101): Max retries exceeded with url: /datasets/d-36970251-8054a79d-887c-ed0381-9a7ce9/value?select=%5B160292%3A163935%5D&domain=%2Fcounts%2Farchs4%2Fmouse_gene_v2.3.h5 (Caused by ResponseError('too many 503 error responses'))
urllib3.exceptions.ResponseError: too many 503 error responses

Is there a problem with the server set up? Or with the file itself? Previously I had success uploading a similar file without error, but it also took two days or so.

assaron commented 3 months ago

Actually, the 503 errors are probably caused by our IT stress-testing exposed services. But the problem of the long uploads remain in place.

I wonder, could the chunk size to be the problem? Here the h5dump output for the largest dataset there:

$ h5dump -H -p -d '/data/expression' mouse_gene_v2.3.h5
Mon Mar 25 04:38:33 PM UTC 2024
HDF5 "mouse_gene_v2.3.h5" {
DATASET "/data/expression" {
   DATATYPE  H5T_STD_U32LE
   DATASPACE  SIMPLE { ( 53511, 932405 ) / ( 53511, H5S_UNLIMITED ) }
   STORAGE_LAYOUT {
      CHUNKED ( 2000, 1 )
      SIZE 31440200353 (6.348:1 COMPRESSION)
   }
   FILTERS {
      COMPRESSION DEFLATE { LEVEL 4 }
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_ALLOC
      VALUE  H5D_FILL_VALUE_DEFAULT
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_INCR
   }
}
}
jreadey commented 3 months ago

Yes, the chunk size could be problematic... hsload will iterate over each chunk in the source dataset and there are more than 24 MM chunks in this case.

If you can do a aws s3 cp and pre-stage the file on a local posix disk, that will speed things up greatly.

Another option would be to use the --nodata option to create the scaffold HSDS domain, and then write a custom python script to copy in the data. If you can setup a n-way partitioning of the data to be loaded, you can run n scripts in parallel. Since the latency is in fetching the data from S3, you should be able to use a fairly large number for n without overloading HSDS. You can use docker stats to judge how busy the HSDS containers are. If the CPU % is over 90 for long stretches, run HSDS with more containers or use a smaller value for n.

Let us know if either of these approaches helps1

assaron commented 3 months ago

@jreadey What do you mean by pre-stage here? I've downloaded the file locally and use hsload ./mouse_gene_v2.3.h5 /counts/archs4/mouse_gene_v2.3.h5

I'm currently trying to change the chunking with h5repack, but apparently it also will take a while, as only 1/10th of the file has been processed in ~1 hour...

jreadey commented 3 months ago

@assaron - Yes, by pre-stage I meant download the file locally. You can use hsload with a s3 source file, but that would be even slower in your case.

h5repack is a reasonable idea, but as you note it takes some time to run as well.

How do you feel about the partitioning idea?

assaron commented 3 months ago

@jreadey Yeah, I have a few files like this, so I can parallel them. I can parallel repack as well.

It still feels a bit weird that repack speed is about 100MB per minute and is limited by CPU. Apparently compression plays a role, as when I add GZIP=1 in filter for repack (instead of level 4 that was there): I get a two times improvement in speed (from 60MB per minute to 120MB per minute). Removing the compression at all makes it even faster, but the files size increases dramatically.

On the other hand, maybe it's relatively reasonable. Repack needs to unpack and pack the data and the unpacked size is several times higher (6 to 1 in the example above). So the speed is actually 500-600 MB per minute of unpacked data, which is slower than just gzip -1 on that data, but couple of times, no and order of magnitude.

@jreadey, thanks for your help. I'm closing the issue, as it's not really HSDS server related. But I wonder, whether the repack style filters can be added for h5py. I imagine that in this situation if I change the layout on the fly it will decrease amount of HSDS API calls. Also, when I do a repack first it had to compress the data again, but for uploading to HSDS server it can be sent uncompressed (if the network speed is high enough), that would also save some time.

jreadey commented 3 months ago

Ok, thanks. FYI - with hsload the data flow will be: client uncompress -> binary transfer of uncompressed data -> compress by HSDS. Potentially it would be faster to just send the compressed data to HSDS, but hsload is also taking the opportunity to re-chunk to a larger chunk size... by default HSDS will scale up chunks to hit a 2-8 MB/chunk size. Feel free to reopen if you have more questions. Also you may find posting to the HDF Forum useful.

assaron commented 3 months ago

Oh, HSDS server also changes chunk size. In that case it would make even more sense to combine multiple chunks in hsload to decrease the number of requests (actually, in the beginning I assumed that it already does this).

Also, I realized that hsload can add packing for uncompressed chunks with -z option, so I can repack to an uncompressed file first and then add compression with hsload. Not sure if it increases the speed though: repacking to an uncompressed file still works at 800MB per minute (of uncompressed data, so it's still around 100-200MB of compressed data).

I'll play a bit more and probably will create a new issue for h5pyd then.