High memory usage spikes by the service node

When making a /datasets/{:dsId}/value request, the service node seems to have spikes of allocating lots of memory (many times more than the size of data requested). This may cause the service node to exceed its memory limit and have the process be terminated.

I ran into this issue while trying to set an HSDS instance for serving files with datasets of about 4,5GiB. As the users access the HSDS through libraries like xarray and h5netcdf, which may try to send requests of the full dataset value to HSDS, we experienced 500 and 503 errors due to the service node process being terminated. I had to configure the service node container to have 24GiB of memory so that it could handle a single request of the 4,5GiB dataset without being terminated.

I've created a dummy example (hsds-memory-usage-peaks.zip) to reproduce this error and recorded the memory usage of the service node process (highlighted in yellow at the terminal's top split showing the htop output) while sending a request:

https://user-images.githubusercontent.com/4401054/166116657-662034bd-000d-4ba3-b8d3-899cb383a46d.mp4

Time | SN Memory
-----|------------
  0s |   924  MiB
  1s | 5.114  MiB
 13s | 9.302  MiB
 14s |    17,3GiB  ← Peak: ~3x the dataset size
 16s |    13,2GiB
 17s |   921  MiB

I've also noticed that the peak memory goes down slowly as the client downloads the response body, but stays high up to its end. This behaviour can be seen by throttling the client download speed:

https://user-images.githubusercontent.com/4401054/166117067-0b305dd9-2fc3-4a6f-92a8-ce5fd87221cb.mp4

Time | SN Memory
-----|------------
  0s |   921  MiB
  0s | 5.111  MiB
 11s | 9.302  MiB
 12s |    13,2GiB
 14s |    17,3GiB  ← Peak: ~3x the dataset size
 15s |    13,2GiB
 22s |    11,1GiB
 32s |    10,1GiB
 38s |     9,6GiB
 40s | 9.562  MiB
 43s |   921  MiB

I've run those tests on CentOS 7 machines (both physical and in WSL) and using both the v0.6.3 and the v0.7beta13 HSDS Docker images published in the Docker Hub.

Is it possible that the HSDS service node is creating unneeded copies of the data, or keeping references to data no longer used for a scope bigger than needed and delaying their disposal?

Hey,

Thanks for the interesting issue report!

In your testing, did you change the max_request_size (default 100m) in HSDS? If I run your test with the default I got a 413 error:

 Maximum request body size 4394000000 exceeded, actual body size 104857600

The idea of the max_request_size is to avoid the container having an out of memory error with these types of large requests. The h5pyd package knows to paginate large requests when it sees this error, but curl or the REST VOL requests will fail on the 413 response.

So I guess you are looking to use a client other than Python or you wouldn't have run into this... You can do the same kind of pagination logic in other languages, but I realize it's a bit tedious.

Looking for the cause of the 3x memory allocation, here's a rough psuedo code of how the GET /dataset/id/value request works (assuming the max_request_size value is increased):

 def GET_Value(request):
     arr = await getSelectionData()
     def getSelectionData():
          arr = await doReadSelection()
          def doReadSelection():
              arr = np.zeros()  # allocates 4GB of data
              def chunkCrawler(arr)
                  def ChunkCrawler - read_chunk_hypeslab(chunkid):
                      # this runs once per chunkid, 16 * node_count max tasks at any one time
                      chunk_slice = await http_get(dn_req)   # allocates one chunk's worth of memory 
                      arr[x1:x2,y1:y2,z1:z3] = chunk_slice.     # chunk_slice should be free to be gc'd now
     # back at GET_Value        
     data = arrayToBytes(arr)l  # just ndarray.tobytes() for non-vlen data
     await resp.write(data)
     return

So it would seem to need 2x the array size of memory for the request. The tobytes() function unfortunately copies the data. There used to be a get_buffer() method in numpy, but it got deprecated. I could even see 3x if the chunk_slice is not being freed immediately, but that is not my understanding of how GC works in Python.

Anyway, I'm currently working on a change that will stream requests. So rather than waiting for the entire response to be assembled on the server and then writing it to the response, the data will be sent continuously (like pagination with h5pyd but on the server side). That should reduce the memory spikes and we should also be able to do away with the max_request_size limit all together. Another advantage is that the client won't need to wait so long for the first bytes to come back from the server and could potentially start doing useful work while the rest of the response streams in.

I'll update this issue when I have something to try out on a branch. It would be great if you could give a try and report back!

Hi, @jreadey, thank you for your response.

In your testing, did you change the max_request_size (default 100m) in HSDS?

Sorry, I've included the config.yml I used in the attached example (hsds-memory-usage-peaks.zip), but I forgot to mention this in the issue description. I've used mostly the default values, except for these two:

sn_ram: increased from 1g to 24g
max_request_size: increased from 100m to 5g

The h5pyd package knows to paginate large requests when it sees this error

I didn't know that. The client we're using is xarray with the h5netcdf engine, as I've mentioned. I don't know their code in detail, but I've looked it briefly and it seems that they call h5pyd to do the actual data access (i.e. send the requests to the server).

I've tried to use h5pyd directly (without xarray and h5netcdf) with the server max_request_size set to the original 100m and it failed, though:

>>> import sys
>>> print(sys.version)
3.6.8 (default, Nov 16 2020, 16:55:22)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
>>> import h5pyd
>>> print(h5pyd.__version__)
0.10.1
>>> f = h5pyd.File('http://localhost:5101/home/example.h5', username='usr01', password='usr01pwd')
>>> ds = f['ds01']
>>> print(ds)
<HDF5 dataset "ds01": shape (1300, 1300, 1300), type "<u2">
>>> value = ds[:,:,:]
Traceback (most recent call last):
  File "/home/tmanhente/.local/lib/python3.6/site-packages/h5pyd/_hl/dataset.py", line 1132, in __getitem__
    rsp = self.GET(req, params=params, format="binary")
  File "/home/tmanhente/.local/lib/python3.6/site-packages/h5pyd/_hl/base.py", line 963, in GET
    raise IOError(rsp.status_code, rsp.reason)
OSError: [Errno 413] Request Entity Too Large

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "trace-exec.py", line 7, in <module>
    exec(l)
  File "<string>", line 1, in <module>
  File "/home/tmanhente/.local/lib/python3.6/site-packages/h5pyd/_hl/dataset.py", line 1141, in __getitem__
    raise IOError("Error retrieving data: {}".format(ioe.errno))
OSError: Error retrieving data: 413

I've retried it raising the HTTPConnection.debuglevel to see the requests sent by h5pyd (full log). It seems that h5pyd tried requesting the following data ranges:

[0:1300:1,0:1300:1,0:1300:1] → 413 Request Entity Too Large (~4,1GiB)
[0: 656:1,0:1300:1,0:1300:1] → 413 Request Entity Too Large (~2,1GiB)
[0: 328:1,0:1300:1,0:1300:1] → 413 Request Entity Too Large (~1,1GiB)
[0: 164:1,0:1300:1,0:1300:1] → 413 Request Entity Too Large (~529MiB)
[0: 82:1,0:1300:1,0:1300:1] → 413 Request Entity Too Large (~264MiB)

and then it gave up trying after the 5th attempt.

It looks like h5pyd has an heuristic to retry the request reducing the range by half, hoping that it fits, rather than actually parsing the 413 response body to get the size limit told by the server and use it to setup a suitable data range:

$ curl --user usr01:usr01pwd --header Accept: application/octet-stream"   "http://localhost:5101/datasets/d-2bf27178-d97f130f-925f-4b3ea4-b0bce0/value?select=%5B0%3A82%3A1%2C0%3A1300%3A1%2C0%3A1300%3A1%5D&domain=/home/example.h5"
Maximum request body size 277160000 exceeded, actual body size 104857600
                                                               ~~~~~~~~~

(it seems that the numbers are misplaced in the response body)

The tobytes() function unfortunately copies the data.

Would it be possible to replace the tobytes() with a view() of numpy.bytes to avoid the data copy? Or that would not fit the interface expected by resp.write()?

Anyway, I'm currently working on a change that will stream requests. So rather than waiting for the entire response to be assembled on the server and then writing it to the response, the data will be sent continuously

Great news! That would be surely very helpful!

If there's anything else you would like me to try, please let me know.

Actually a mix of using h5pyd pagination and a (modest) increase in the max_request_size seems to work... If I set the max_request_size to 300m, then your program is able read the entire 4GB dataset.

The reason this is that the h5pyd pagination works by reducing the selection range in the first dimension at the chunk boundry. In this case, with a dataset shape of (1300,1300,1300) and a chunk shape of (82, 82. 163), the minimal pagination is: (82, 1300, 1300) which works out to be 264MB. I was able to set the max_request_size to 300m, leave the container mem size as it was, and everything worked.

Give this a try and let me know if it works for you. Appologize if this seems a bit hacky, but we won't need to deal with this one the streaming logic is in.

BTW, did you run hsconfigure to save your endpoint, username, and password to the .hscfg file?
Unless endpoint, username, and password or supplied in the h5pyd.File arguments, h5pyd whill just pull these from the .hscfg. So then the file open would look like: f = h5pyd.File('home/example.h5'). You get the same result either way, just less verbose using the .hscfg style.

Update - the streaming support is checked into master now. This has the effect that any size request can be sent or received to/from the service as long as it is binary and uses a fixed-size type (support for variable types may be added later but I don't think it makes sense to send gb's of JSON back and forth). Memory usage on the server should be limited regardless of large the request is.

There's a test case for the feature at: https://github.com/HDFGroup/hsds/tree/master/tests/perf/stream.
It does the following:

creates a binary representation of num_rows x num_cols x uint64 array. (default is num_rows = 12000, num_cols = 2200, but this can adjusted in the config)
write the data as one dataset/value request
reads the data back as one request
verifies that the received data is the same as what got sent out

e.g.:

$ python stream_test.py
testStream2D /home/john/perf2/stream.h5
dataset shape: [12000, 2200]
got dset_id: d-3424aca7-bd556ffc-2662-75c052-d3ecc1
initializing test data (211200000 bytes, 201.42 MiB) 
writing...
elapsed: 4.43 s, 45.48 mb/s
reading...
elapsed: 3.96 s, 50.86 mb/s
comparing sent vs. received
passed!

Let me know if you run into any problems with this.
The HSDS docker image is on DockerHub: hdfgroup/hsds:36a7c61

Hi, @jreadey, sorry for the delay in replying to you.

Actually a mix of using h5pyd pagination and a (modest) increase in the max_request_size seems to work... If I set the max_request_size to 300m, then your program is able read the entire 4GB dataset. [...] Give this a try and let me know if it works for you.

I've tried setting the max_request_size to 300m and it worked: h5pyd managed to download the 4GB dataset by sending a series of smaller (~264MB) requests to the server. The service node uses up to 1,9GB of memory at each of the requests, which is still a lot of memory, but within a reasonable limit.

BTW, did you run hsconfigure to save your endpoint, username, and password to the .hscfg file?

Yes, I do have an .hscf file, but it point to the real server and with my real credentials. As I set up the dummy example I've attached in this issue to isolate the configuration and test scenario to reproduce the memory spikes, I found it easier to simply put this dummy server address and its dummy user credentials directly in the h5pyd.File() call rather than swapping back and forth my .hscfg file between the real and the dummy servers.

Update - the streaming support is checked into master now.

Nice! I've tried the streaming support by changing the HSDS Docker image version I was using from v0.7beta13 to 36a7c61 and it worked for the 4GB data: h5pyd was able to get the entire dataset issuing only a single request. The service node used at most 1,2GB memory. I didn't notice any high memory usage spike.

Ok, great - glad to hear everything worked!

If you want to further reduce the memory footprint, you can make the max_request_size config even less. In streaming mode, the server is processing data in max_request_size pages. Making it too small though may reduce performance or result in 413 errors.

Will close this issue now.

HDFGroup / hsds

High memory usage spikes by the service node #141