HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
125 stars 52 forks source link

not enough room in chunk cache - return 503 #332

Closed assaron closed 2 months ago

assaron commented 3 months ago

Hi, I continue to try upload the large file from https://github.com/HDFGroup/hsds/issues/329 and I still get the issues with the upload failing because of too many 503 errors.

Currently I have a situation when the upload fails very fast after starting, I guess because the server got into some bad state. Here is the relevant portion of hsload log:

Mar 26 18:53:41 got link: submission_date
Mar 26 18:53:41 got link: taxid_ch1
Mar 26 18:53:41 got link: title
Mar 26 18:53:41 got link: type
Mar 26 18:53:41 write_dataset src: /data/expression to tgt: /data/expression, shape: (53511, 932405), type: uint32
Mar 26 18:53:41 iterating over chunks for /data/expression
Mar 26 18:53:41 urllib3.exceptions.ResponseError: too many 500 error responses
Mar 26 18:53:41 
Mar 26 18:53:41 The above exception was the direct cause of the following exception:
Mar 26 18:53:41 
Mar 26 18:53:41 Traceback (most recent call last):
Mar 26 18:53:41   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
Mar 26 18:53:41     resp = conn.urlopen(
Mar 26 18:53:41   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/urllib3/connectionpool.py", line 948, in urlopen
Mar 26 18:53:41     return self.urlopen(
Mar 26 18:53:41   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/urllib3/connectionpool.py", line 948, in urlopen
Mar 26 18:53:41     return self.urlopen(
Mar 26 18:53:41   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/urllib3/connectionpool.py", line 948, in urlopen
Mar 26 18:53:41     return self.urlopen(
Mar 26 18:53:41   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/urllib3/connectionpool.py", line 938, in urlopen
Mar 26 18:53:41     retries = retries.increment(method, url, response=response, _pool=self)
Mar 26 18:53:41   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
Mar 26 18:53:41     raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
Mar 26 18:53:41 urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=5101): Max retries exceeded with url: /datasets/d-f3bde52c-2807ffd5-b695-f720e8-aba384/value?select=%5B0%3A53511%2C0%3A10%5D&domain=%2Ftest%2Fmouse.h5 (Caused by ResponseError('too many 500 error responses'))

It looks like it fails when try to submitting even the first chunk.

When I check the datanode container, I see the following errors in the log:

INFO> fileClient.put_object(db/f3bde52c-2807ffd5/.info.json bucket=public) start=1711479169.1833 finish=1711479169.1840 elapsed=0.0006 bytes=7284
INFO> _getFileStats(/data/public/db/f3bde52c-2807ffd5/.info.json) returning: {'ETag': '70e175daa4812ef6e282e1bfe9cdea73', 'Size': 7284, 'LastModified': 1711479169.1796105}                                         
INFO> bucketScan - sleep: 1                                                                               
REQ> PUT: /chunks/c-f3bde52c-2807ffd5-b695-f720e8-aba384_0_0 [172.20.0.4:6101]
WARN> PUT_Chunk c-f3bde52c-2807ffd5-b695-f720e8-aba384_0_0 - not enough room in chunk cache - return 503                                                                                                            
INFO> bucketGC - sleep: 1            
INFO> bucketScan - sleep: 1              
INFO> bucketGC - sleep: 1                                                                                 
INFO> bucketScan - sleep: 1                                                                               
INFO> bucketGC - sleep: 1                                                                                 
INFO> bucketScan - sleep: 1                                                                                                                                                                                         
INFO> bucketGC - sleep: 1                                                                                 
INFO> bucketScan - sleep: 1                                                                                                                                                                                         
INFO> bucketGC - sleep: 1                                                                                 
INFO> bucketScan - sleep: 1                                                                                                                                                                                         
INFO> bucketGC - sleep: 1
INFO> bucketScan - sleep: 1
INFO> bucketGC - sleep: 1
INFO> bucketScan - sleep: 1
REQ> PUT: /chunks/c-f3bde52c-2807ffd5-b695-f720e8-aba384_0_0 [172.20.0.4:6101]
WARN> PUT_Chunk c-f3bde52c-2807ffd5-b695-f720e8-aba384_0_0 - not enough room in chunk cache - return 503 
REQ> GET: /datasets/d-f3bde52c-2807ffd5-b695-f720e8-aba384 [172.20.0.4:6101]
INFO> get_metadata_obj: d-f3bde52c-2807ffd5-b695-f720e8-aba384 bucket: public
 RSP> <200> (OK): /datasets/d-f3bde52c-2807ffd5-b695-f720e8-aba384
REQ> PUT: /chunks/c-f3bde52c-2807ffd5-b695-f720e8-aba384_0_0 [172.20.0.4:6101]
WARN> PUT_Chunk c-f3bde52c-2807ffd5-b695-f720e8-aba384_0_0 - not enough room in chunk cache - return 503 
REQ> PUT: /chunks/c-f3bde52c-2807ffd5-b695-f720e8-aba384_0_0 [172.20.0.4:6101]
WARN> PUT_Chunk c-f3bde52c-2807ffd5-b695-f720e8-aba384_0_0 - not enough room in chunk cache - return 503 
REQ> PUT: /chunks/c-f3bde52c-2807ffd5-b695-f720e8-aba384_0_0 [172.20.0.4:6101]
WARN> PUT_Chunk c-f3bde52c-2807ffd5-b695-f720e8-aba384_0_0 - not enough room in chunk cache - return 503 

What's going on here? Is this an expected behavior? For me it doesn't sound intuitive that the client request should fail when some internal cache got saturated. It doesn't look like any flushing of the cache is initiated either.

assaron commented 3 months ago

And the strange thing that this error happen eraly, even if I stop the server with stopall.sh and start it again with runall.sh. Can it be the manifestation of some other problem? Like running out of system-wide file descriptors or something like this?

jreadey commented 3 months ago

The 503 is generated when the chunk cache is full of dirty chunks and there's no room for the incoming change. Normally this should just be a temporary state while the dirty chunks get flushed to storage. In the DN logs, do you see anything suspicious in the "s3sync" log messages? This could be an issue with your disk being out of space for example.

assaron commented 3 months ago

Hm, actually I've changed s3_sync_interval to some large number, as I assumed it has something to do with AWS S3 which I don't use, and apparently that was incorrect. That explains why the errors started to happen early. I changed it as this s3_sync messages were spamming the logs (from my perspective) when I tried to figure out what was the problem originally.

So am I right that this error can happen if hsload uploads the data faster than it can be written to disk by the server?

I've changed the interval back to 1 second and increased chunk cache size to 256M, let's see if I'll be able to upload the file now.

jreadey commented 3 months ago

Sorry about the confusion with the config names -- it's a bit of historical artifact: HSDS was first developed with only AWS S3 for storage. Later support was added for Azure blobs and Posix. By that point, it was too late to go back and rename everything.

Yes, you'll get the error if data is coming in faster than it can be written (once the memory buffer is used up). 503's should not be viewed as an error per se - just the server letting you know that it can't keep up with the request traffic at the moment.

In h5pyd, there's built in backoff logic to resend requests when a 503 is received.

assaron commented 3 months ago

Yeah, I totally understand the historic reasons :) Maybe the corresponding comment (https://github.com/HDFGroup/hsds/blob/master/admin/config/config.yml#L36) in the config file could be updated expanding on that a bit.

Meanwhile I still have the problem with too many 503s, but now for different reasons. Here is my setup and what I get:

1) I run HSDS posix server from docker with one datanode. I have only one node for simplified debugging, but apparently the problem, as I had it first for 4 nodes as well.

2) I'm trying to upload the big file (I had this error once with a smaller file of 2.4 GB but not sure). At some point (20:46:42) the client fails because of too many 503s.

3) When I look at SN node logs around that time, I see the following:

2024-03-26T20:46:12.074469675Z INFO> healthCheck - node_state: READY
2024-03-26T20:46:12.074482426Z INFO> register: http://head:5100/register
2024-03-26T20:46:12.074485613Z INFO> register req: http://head:5100/register body: {'id': 'sn-0b035', 'port': 5101, 'node_type': 'sn'}
2024-03-26T20:46:12.074488444Z INFO> http_post('http://head:5100/register', 3 bytes)
2024-03-26T20:46:12.076179602Z INFO> http_post status: 200
2024-03-26T20:46:12.076282679Z INFO> register response: {'cluster_state': 'WAITING', 'sn_urls': ['http://172.25.0.3:5101'], 'dn_urls': [], 'sn_ids': ['sn-0b035'], 'dn_ids': [], 'req_ip': '172.25.0.3'}
2024-03-26T20:46:12.076295539Z INFO> update_dn_info - dn_nodes: {'dn-5be79'} are no longer active
2024-03-26T20:46:12.076312153Z INFO> scaling - number of dn nodes has changed from 1 to 0
2024-03-26T20:46:12.076322315Z INFO> setting node_state from READY to WAITING
2024-03-26T20:46:22.078196319Z INFO> healthCheck - node_state: WAITING
2024-03-26T20:46:22.078207013Z INFO> register: http://head:5100/register
2024-03-26T20:46:22.078209902Z INFO> register req: http://head:5100/register body: {'id': 'sn-0b035', 'port': 5101, 'node_type': 'sn'}
2024-03-26T20:46:22.078212774Z INFO> http_post('http://head:5100/register', 3 bytes)
2024-03-26T20:46:22.079730958Z INFO> http_post status: 200
2024-03-26T20:46:22.079785193Z INFO> register response: {'cluster_state': 'WAITING', 'sn_urls': ['http://172.25.0.3:5101'], 'dn_urls': [], 'sn_ids': ['sn-0b035'], 'dn_ids': [], 'req_ip': '172.25.0.3'}
2024-03-26T20:46:23.481801262Z ERROR> Unexpected exception <class 'asyncio.exceptions.TimeoutError'> for write_chunk_hyperslab(c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22):
2024-03-26T20:46:23.481824820Z WARN> ChunkCrawler.doWork - retry: 1, sleeping for 0.26
2024-03-26T20:46:23.740548618Z INFO> write_chunk_hyperslab, chunk_id:c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22, slices:(slice(659383, 663026, 1),), bucket: public
2024-03-26T20:46:23.740561739Z WARN> Service not ready
2024-03-26T20:46:23.740834504Z WARN> HTTPServiceUnavailable for write_chunk_hyperslab(c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22): Service Unavailable
2024-03-26T20:46:23.740845789Z WARN> ChunkCrawler.doWork - retry: 2, sleeping for 0.45
2024-03-26T20:46:24.190498874Z INFO> write_chunk_hyperslab, chunk_id:c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22, slices:(slice(659383, 663026, 1),), bucket: public
2024-03-26T20:46:24.190540235Z WARN> Service not ready
2024-03-26T20:46:24.190811659Z WARN> HTTPServiceUnavailable for write_chunk_hyperslab(c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22): Service Unavailable
2024-03-26T20:46:24.190815994Z WARN> ChunkCrawler.doWork - retry: 3, sleeping for 0.84
2024-03-26T20:46:25.026741833Z INFO> write_chunk_hyperslab, chunk_id:c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22, slices:(slice(659383, 663026, 1),), bucket: public
2024-03-26T20:46:25.026769802Z WARN> Service not ready
2024-03-26T20:46:25.027061408Z WARN> HTTPServiceUnavailable for write_chunk_hyperslab(c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22): Service Unavailable
2024-03-26T20:46:25.027067480Z WARN> ChunkCrawler.doWork - retry: 4, sleeping for 1.69
2024-03-26T20:46:26.717738706Z INFO> write_chunk_hyperslab, chunk_id:c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22, slices:(slice(659383, 663026, 1),), bucket: public
2024-03-26T20:46:26.717760562Z WARN> Service not ready
2024-03-26T20:46:26.718044763Z WARN> HTTPServiceUnavailable for write_chunk_hyperslab(c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22): Service Unavailable
2024-03-26T20:46:26.718058338Z WARN> ChunkCrawler.doWork - retry: 5, sleeping for 3.24
...
024-03-26T20:46:36.414733766Z INFO> write_chunk_hyperslab, chunk_id:c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22, slices:(slice(659383, 663026, 1),), bucket: public
2024-03-26T20:46:36.414744826Z WARN> Service not ready
2024-03-26T20:46:36.414786584Z WARN> HTTPServiceUnavailable for write_chunk_hyperslab(c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22): Service Unavailable
2024-03-26T20:46:36.414792770Z ERROR> ChunkCrawler action: write_chunk_hyperslab failed after: 7 retries
2024-03-26T20:46:36.414795810Z INFO> ChunkCrawler - worker status for chunk c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22: 503
2024-03-26T20:46:36.414871840Z INFO> ChunkCrawler - join complete - count: 1
2024-03-26T20:46:36.414875693Z INFO> returning chunk_status: 503 for chunk: c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22
2024-03-26T20:46:36.414879323Z WARN> crawler failed for page: 1 with status: 503
2024-03-26T20:46:36.414883135Z INFO> doWriteSelection raising HTTPInternalServerError for status:  503
2024-03-26T20:46:36.416784043Z REQ> PUT: /datasets/d-5743d89e-3dd64172-4be5-fb67d4-b2c297/value [/test/mouse.h5]
2024-03-26T20:46:36.416792613Z WARN> returning 503 - node_state: WAITING
2024-03-26T20:46:38.420338860Z REQ> PUT: /datasets/d-5743d89e-3dd64172-4be5-fb67d4-b2c297/value [/test/mouse.h5]
2024-03-26T20:46:38.420354866Z WARN> returning 503 - node_state: WAITING
...
2024-03-26T20:48:12.159589164Z INFO> healthCheck - node_state: WAITING
2024-03-26T20:48:12.159601004Z INFO> register: http://head:5100/register
2024-03-26T20:48:12.159604422Z INFO> register req: http://head:5100/register body: {'id': 'sn-0b035', 'port': 5101, 'node_type': 'sn'}
2024-03-26T20:48:12.159607437Z INFO> http_post('http://head:5100/register', 3 bytes)
2024-03-26T20:48:12.161416055Z INFO> http_post status: 200
2024-03-26T20:48:12.161484946Z INFO> register response: {'cluster_state': 'WAITING', 'sn_urls': ['http://172.25.0.3:5101'], 'dn_urls': [], 'sn_ids': ['sn-0b035'], 'dn_ids': [], 'req_ip': '172.25.0.3'}
2024-03-26T20:48:22.172166029Z INFO> healthCheck - node_state: WAITING
2024-03-26T20:48:22.172201294Z INFO> register: http://head:5100/register
2024-03-26T20:48:22.172204679Z INFO> register req: http://head:5100/register body: {'id': 'sn-0b035', 'port': 5101, 'node_type': 'sn'}
2024-03-26T20:48:22.172207904Z INFO> http_post('http://head:5100/register', 3 bytes)
2024-03-26T20:48:22.173938215Z INFO> http_post status: 200
2024-03-26T20:48:22.174012492Z INFO> register response: {'cluster_state': 'READY', 'sn_urls': ['http://172.25.0.3:5101'], 'dn_urls': ['http://172.25.0.4:6101'], 'sn_ids': ['sn-0b035'], 'dn_ids': ['dn-5be79'], 'req_ip': '172.25.0.3'}
2024-03-26T20:48:22.174057259Z INFO> update_dn_info - dn_nodes: {'dn-5be79'} are now active
2024-03-26T20:48:22.174061732Z INFO> scaling - number of dn nodes has changed from 0 to 1
2024-03-26T20:48:22.174064576Z INFO> setting node_state from WAITING to READY
2024-03-26T20:48:32.185151482Z INFO> healthCheck - node_state: READY

So the data node went down from 20:46:12 to 20:48:22.

4) When I look at data node logs I see the following:

Mar 26 20:45:52 REQ> PUT: /chunks/c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22 [172.25.0.4:6101]
Mar 26 20:45:52 INFO> get_metadata_obj: d-5743d89e-3dd64172-4be5-fb67d4-b2c297 bucket: public
Mar 26 20:45:52 INFO> getStorBytes(public/db/5743d89e-3dd64172/d/4be5-fb67d4-b2c297/22, offset=0, length: 0)
Mar 26 20:45:52 INFO> get_object - filepath: /data/public/db/5743d89e-3dd64172/d/4be5-fb67d4-b2c297/22
Mar 26 20:45:52 WARN> fileClient: db/5743d89e-3dd64172/d/4be5-fb67d4-b2c297/22 not found
Mar 26 20:45:52  RSP> <200> (OK): /chunks/c-5743d89e-3dd64172-4be5-fb67d4-b2c297_22
Mar 26 20:45:52 INFO> root_scan id g-5743d89e-3dd64172-dfcb-5016b5-5ec9fa: bucket: public timestamp: 1711485946.1338894
Mar 26 20:45:52 INFO> add g-5743d89e-3dd64172-dfcb-5016b5-5ec9fa to scan list
Mar 26 20:45:52 INFO> bucketScan for: g-5743d89e-3dd64172-dfcb-5016b5-5ec9fa bucket: public
Mar 26 20:45:52 INFO> scanRoot for rootid: g-5743d89e-3dd64172-dfcb-5016b5-5ec9fa bucket: public
Mar 26 20:45:52 INFO> getStorKeys('db/5743d89e-3dd64172/','','', include_stats=True
Mar 26 20:45:52 INFO> list_keys('db/5743d89e-3dd64172/','','' include_stats=True, bucket=public, callback set
Mar 26 20:45:52 INFO> bucketGC - sleep: 1
Mar 26 20:45:53 INFO> _getFileStats(/data/public/db/5743d89e-3dd64172/.group.json) returning: {'ETag': 'b3a282e7e6008631fe974ab9b02cf5ff', 'Size': 424, 'LastModified': 1711483164.497876}
Mar 26 20:45:53 INFO> _getFileStats(/data/public/db/5743d89e-3dd64172/g/fff4-cacb14-cafc9c/.group.json) returning: {'ETag': '52e320831afda230deb0cf49f894e5f4', 'Size': 3780, 'LastModified': 1711483164.6498787}
Mar 26 20:45:53 INFO> _getFileStats(/data/public/db/5743d89e-3dd64172/g/e4c4-50fbc7-f56280/.group.json) returning: {'ETag': 'a36257fa84b48ecd9b0ec7063bbdaea5', 'Size': 552, 'LastModified': 1711483164.497876}
...
< ~93000 lines with getFileStatus are omitted here >
...
 Mar 26 20:48:19 INFO> _getFileStats(/data/public/db/5743d89e-3dd64172/d/01ca-e95534-ee3dc8/8) returning: {'ETag': '97b0d19a5be5c987829e533961b62636', 'Size': 5252, 'LastModified': 1711485876.3238943}
Mar 26 20:48:19 INFO> _getFileStats(/data/public/db/5743d89e-3dd64172/d/01ca-e95534-ee3dc8/9) returning: {'ETag': '12a731b8a59078230a72dad20c6dcf5c', 'Size': 7804, 'LastModified': 1711485876.9439056}
Mar 26 20:48:19 INFO> _getFileStats(/data/public/db/5743d89e-3dd64172/d/3942-89943b-05b3c4/.dataset.json) returning: {'ETag': 'e3ff1e1215a2a99d525ccd8d5d7feff6', 'Size': 601, 'LastModified': 1711483163.4018543}
Mar 26 20:48:21 INFO> listKeys done, got 93520 keys
Mar 26 20:48:21 INFO> getStorKeys done for prefix: db/5743d89e-3dd64172/
Mar 26 20:48:21 INFO> scanRoot - got 93520 keys for rootid: g-5743d89e-3dd64172-dfcb-5016b5-5ec9fa
Mar 26 20:48:21 INFO> got dataset: d-5743d89e-3dd64172-4024-fa1e1c-4304cd: {'lastModified': 1711483163.297852, 'num_chunks': 0, 'allocated_bytes': 0, 'logical_bytes': 0, 'linked_bytes': 0, 'num_linked_chunks': 0}
Mar 26 20:48:21 INFO> getStorJSONObj(public)/db/5743d89e-3dd64172/d/4024-fa1e1c-4304cd/.dataset.json
Mar 26 20:48:21 INFO> get_object - filepath: /data/public/db/5743d89e-3dd64172/d/4024-fa1e1c-4304cd/.dataset.json
Mar 26 20:48:21 INFO> s3sync nothing to update
Mar 26 20:48:21 INFO> s3syncCheck no objects to write, sleeping for 1.00
Mar 26 20:48:21 INFO> bucketGC - sleep: 1
Mar 26 20:48:21 INFO> healthCheck - node_state: READY
Mar 26 20:48:21 INFO> register: http://head:5100/register
Mar 26 20:48:21 INFO> register req: http://head:5100/register body: {'id': 'dn-5be79', 'port': 6101, 'node_type': 'dn'}
Mar 26 20:48:21 INFO> http_post('http://head:5100/register', 3 bytes)

So this list_keys procedure brought the node down for 2.5 minutes, which hsload wasn't able to deal with.

Is that an expected behavior? Or did I misconfigure something? Shouldn't multiple data nodes resolve this problem? I believe I still had it with 4 nodes, but maybe that was a different problem.

I'm now trying to run hsload with retries=20. If I understand correctly that the delay between the retries goes up exponentially that should work around this issue.

assaron commented 3 months ago

No luck, but now it's a bit different problem again.

I now use the following command hsload --verbose --retries 20 mouse_gene_v2.3.h5 /test/mouse_2.3.h5. At some point data node goes down as before, but now I have different error from hsload:

Mar 27 00:02:59 writing dataset data for slice: (slice(604738, 608381, 1),)
Mar 27 00:02:59 writing dataset data for slice: (slice(608381, 612024, 1),)
Mar 27 00:03:43 ERROR 2024-03-27 00:03:43,386 ERROR : failed to copy dataset data (slice(637525, 641168, 1),): unsupported operand type(s) for -=: 'str' and 'int'
Mar 27 00:03:43 writing dataset data for slice: (slice(612024, 615667, 1),)
Mar 27 00:03:43 writing dataset data for slice: (slice(615667, 619310, 1),)
Mar 27 00:03:43 writing dataset data for slice: (slice(619310, 622953, 1),)
Mar 27 00:03:43 writing dataset data for slice: (slice(622953, 626596, 1),)
Mar 27 00:03:43 writing dataset data for slice: (slice(626596, 630239, 1),)
Mar 27 00:03:43 writing dataset data for slice: (slice(630239, 633882, 1),)
Mar 27 00:03:43 writing dataset data for slice: (slice(633882, 637525, 1),)
Mar 27 00:03:43 Traceback (most recent call last):
Mar 27 00:03:43   File "/home/alserg/.local/bin/hsload", line 8, in <module>
Mar 27 00:03:43     sys.exit(main())
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/h5pyd/_apps/hsload.py", line 290, in main
Mar 27 00:03:43     load_file(fin, fout, **kwargs)
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/h5pyd/_apps/utillib.py", line 1632, in load_file
Mar 27 00:03:43     fin.visititems(object_copy_helper)
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/h5py/_hl/group.py", line 668, in visititems
Mar 27 00:03:43     return h5o.visit(self.id, proxy)
Mar 27 00:03:43   File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
Mar 27 00:03:43   File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
Mar 27 00:03:43   File "h5py/h5o.pyx", line 354, in h5py.h5o.visit
Mar 27 00:03:43   File "h5py/h5o.pyx", line 301, in h5py.h5o.cb_obj_simple
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/h5py/_hl/group.py", line 667, in proxy
Mar 27 00:03:43     return func(name, self[name])
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/h5pyd/_apps/utillib.py", line 1604, in object_copy_helper
Mar 27 00:03:43     write_dataset(obj, tgt, ctx)
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/h5pyd/_apps/utillib.py", line 1373, in write_dataset
Mar 27 00:03:43     tgt[tgt_s] = arr
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/h5pyd/_hl/dataset.py", line 1584, in __setitem__
Mar 27 00:03:43     self.PUT(req, body=body, format=format, params=params)
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/h5pyd/_hl/base.py", line 1007, in PUT
Mar 27 00:03:43     rsp = self._id._http_conn.PUT(req, body=body, params=params, format=format)
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/h5pyd/_hl/httpconn.py", line 587, in PUT
Mar 27 00:03:43     rsp = s.put(
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/requests/sessions.py", line 649, in put
Mar 27 00:03:43     return self.request("PUT", url, data=data, **kwargs)
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
Mar 27 00:03:43     resp = self.send(prep, **send_kwargs)
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
Mar 27 00:03:43     r = adapter.send(request, **kwargs)
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
Mar 27 00:03:43     resp = conn.urlopen(
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/urllib3/connectionpool.py", line 938, in urlopen
Mar 27 00:03:43     retries = retries.increment(method, url, response=response, _pool=self)
Mar 27 00:03:43   File "/home/alserg/.local/pipx/venvs/h5pyd/lib/python3.10/site-packages/urllib3/util/retry.py", line 449, in increment
Mar 27 00:03:43     total -= 1
Mar 27 00:03:43 TypeError: unsupported operand type(s) for -=: 'str' and 'int'

I guess retries number is not converted to an int somewhere?

assaron commented 3 months ago

After fixing conversion of retries to int I was finally able to upload the file. I didn't do a PR as I'm not sure it's the apropriate way to do this, but here is my patch.

diff --git a/h5pyd/_apps/hsload.py b/h5pyd/_apps/hsload.py
index c0dae7c..8fb72ee 100755
--- a/h5pyd/_apps/hsload.py
+++ b/h5pyd/_apps/hsload.py
@@ -233,7 +233,7 @@ def main():
                     "endpoint": cfg["hs_endpoint"],
                     "bucket": cfg["hs_bucket"],
                     "mode": mode,
-                    "retries": cfg["retries"],
+                    "retries": int(cfg["retries"]),
                 }

                 fout = h5pyd.File(tgt, **kwargs)

Still, I don't really like the history of the server status. Here is the excerpt from the logs (I grepped "update_dn_info" in SN node logs):

2024-03-27T00:47:28.768426991Z INFO> update_dn_info - dn_nodes: {'dn-b52e9', 'dn-71c0c'} are now active
2024-03-27T00:47:38.781317856Z INFO> update_dn_info - dn_nodes: {'dn-623fa', 'dn-7ac2c'} are now active
2024-03-27T01:25:19.700447416Z INFO> update_dn_info - dn_nodes: {'dn-7ac2c'} are no longer active
2024-03-27T01:26:09.718891894Z INFO> update_dn_info - dn_nodes: {'dn-71c0c'} are no longer active
2024-03-27T01:26:39.728491309Z INFO> update_dn_info - dn_nodes: {'dn-b52e9'} are no longer active
2024-03-27T01:26:39.728496136Z INFO> update_dn_info - dn_nodes: {'dn-7ac2c'} are now active      
2024-03-27T01:27:19.896704338Z INFO> update_dn_info - dn_nodes: {'dn-7ac2c'} are no longer active
2024-03-27T01:27:29.899958012Z INFO> update_dn_info - dn_nodes: {'dn-71c0c'} are now active                                                                                                                         
2024-03-27T01:28:09.916915532Z INFO> update_dn_info - dn_nodes: {'dn-b52e9'} are now active      
2024-03-27T01:28:19.920248543Z INFO> update_dn_info - dn_nodes: {'dn-71c0c'} are no longer active
2024-03-27T01:28:39.928155902Z INFO> update_dn_info - dn_nodes: {'dn-b52e9'} are no longer active
2024-03-27T01:28:49.932225709Z INFO> update_dn_info - dn_nodes: {'dn-7ac2c'} are now active
2024-03-27T01:29:40.417611488Z INFO> update_dn_info - dn_nodes: {'dn-71c0c'} are now active
2024-03-27T01:29:50.424240733Z INFO> update_dn_info - dn_nodes: {'dn-7ac2c'} are no longer active
2024-03-27T01:30:00.428970708Z INFO> update_dn_info - dn_nodes: {'dn-71c0c'} are no longer active
2024-03-27T01:30:00.428973640Z INFO> update_dn_info - dn_nodes: {'dn-b52e9'} are now active
2024-03-27T01:30:30.437302472Z INFO> update_dn_info - dn_nodes: {'dn-b52e9'} are no longer active
2024-03-27T01:31:10.663728217Z INFO> update_dn_info - dn_nodes: {'dn-7ac2c'} are now active
2024-03-27T01:31:20.668844741Z INFO> update_dn_info - dn_nodes: {'dn-71c0c'} are now active
2024-03-27T01:31:30.672212219Z INFO> update_dn_info - dn_nodes: {'dn-7ac2c'} are no longer active
2024-03-27T01:31:40.687695722Z INFO> update_dn_info - dn_nodes: {'dn-623fa'} are no longer active
2024-03-27T01:32:00.695297468Z INFO> update_dn_info - dn_nodes: {'dn-b52e9'} are now active
2024-03-27T01:32:10.704267839Z INFO> update_dn_info - dn_nodes: {'dn-71c0c'} are no longer active
2024-03-27T01:32:30.714140721Z INFO> update_dn_info - dn_nodes: {'dn-b52e9'} are no longer active
< NB: there are no more active DNs here>
2024-03-27T01:33:00.725386883Z INFO> update_dn_info - dn_nodes: {'dn-623fa', 'dn-7ac2c'} are now active
2024-03-27T01:33:40.806102841Z INFO> update_dn_info - dn_nodes: {'dn-b52e9', 'dn-71c0c'} are now active
2024-03-27T01:33:50.809363222Z INFO> update_dn_info - dn_nodes: {'dn-7ac2c'} are no longer active
2024-03-27T01:34:51.790716082Z INFO> update_dn_info - dn_nodes: {'dn-71c0c'} are no longer active
2024-03-27T01:35:01.793916853Z INFO> update_dn_info - dn_nodes: {'dn-7ac2c'} are now active
2024-03-27T01:36:12.373607282Z INFO> update_dn_info - dn_nodes: {'dn-7ac2c'} are no longer active
2024-03-27T01:36:12.373611742Z INFO> update_dn_info - dn_nodes: {'dn-71c0c'} are now active
2024-03-27T01:36:32.379659345Z INFO> update_dn_info - dn_nodes: {'dn-71c0c'} are no longer active
2024-03-27T01:37:22.397715504Z INFO> update_dn_info - dn_nodes: {'dn-b52e9'} are no longer active
2024-03-27T01:37:42.492858761Z INFO> update_dn_info - dn_nodes: {'dn-71c0c'} are now active
2024-03-27T01:38:02.645412336Z INFO> update_dn_info - dn_nodes: {'dn-71c0c'} are no longer active
2024-03-27T01:38:12.649502025Z INFO> update_dn_info - dn_nodes: {'dn-623fa'} are no longer active

Basically, after some time, data nodes constantly go down because of this list_keys operation (not sure what triggers it and what its purpose). So even if I have multiple nodes, at some point it results in all of them being down at the same time. I would be thankful for some guidance how to configure the server better, if it's possible.

jreadey commented 3 months ago

So if you do a docker ps, you see different create times for the DN containers? In general they are fairly robust - I have containers run without a restart for weeks at a time under load. When there is a crash, it's usually due to an out of memory exception. This can happen for example when the amount of memory used for inflight requests + chunk cache + metadata cache exceeds the memory limit in the docker config. So it's important if you bump the chunk cache size you set the DN_RAM environment variable to a large enough value. The DN_RAM gets used by the docker compose yaml.

Anyway, I'll try loading the file myself and see if I hit similar issues.

assaron commented 3 months ago

@jreadey No, DN containers aren't fully failing and continue to work. However, when this list_keys operation is happening they don't process other requests, so after about 20 seconds, SN marks them as inactive. After list_keys finishes (which takes couple of minutes) they report back to SN and it marks them as active again.

The problem seem to stem from listKeys operation that needs to walk over 93K keys which takes significant amount of time. This number of keys is very similar to the number of chunks reported by hsstat. I guess this problem can happen when the dataset/h5-file which is being uploaded results in many chunks.

jreadey commented 3 months ago

OK, I think I understand what's going on now. The SN container(s) poll the DN's periodically to make sure they are still alive. If a DN gets bogged down in a request that takes a long time to process (without any await), the SN will assume the DN is down and not sending any more requests until the health check comes back. The DN's use asyncio, so that if say there's a bit of a wait to read or write data that won't interfere with the SN health checks.

With S3 storage this all goes as planned since all the S3 calls are asynchronous. For Posix storage we use aiofiles which has async calls for reading and writing, but not listing files (see https://github.com/Tinche/aiofiles/issues/160). So with a large number of files to list, list_dir is taking too long and misses the health check.

Ironically, list_dir is not critical for hsload. It get's called by scanRoot which is a background task whose purpose is to tabulate the total storage used by a domain (which you'll see if you do hsstat ).

I have a patch which should fix this. I'll give it a try and report back.

assaron commented 3 months ago

Great! Thanks for looking into this!

jreadey commented 3 months ago

My load failed about a third of the way through with a retries exceeded message. Checking the logs there were 4 "no longer active" lines, which at least seems an improvement.

I'll try another run and update the default retries and increase the time before a node is not considered active (the head_sleep_time, node_sleep_time configs).

jreadey commented 3 months ago

Second attempt worked - no errors with hsload --retries 10. On the HSDS side, I used the Listkeys branch (now merged to master) and the following overrides (in hsds/admin/config/overrides.yml):

head_sleep_time: 60 # max sleep time between health checks for head node
node_sleep_time: 60 # max sleep time between health checks for SN/DN nodes

These change the time between health checks from 10 seconds to 60.

@assaron - let me know if you are able to do a hsload with this setup.

assaron commented 2 months ago

@jreadey thanks for the updates! I've checked the new version and the number of inactive nodes definitely decreased, even with the default server settings. It still failed on my first attempt with the default retries number, but the second one with --retries 10 worked fine.

Though I don't understand why DN nodes still go down at all. I'd expect that your patch that adds breaks for ListKeys should have fixed that completely.

jreadey commented 2 months ago

I'm not sure why we are still seeing an occasional inactive node. I suspect there is still somewhere in the code that is not letting other tasks run for a significant time. I'll keep an eye out for any pieces of code that seem suspicious.
Since it doesn't show up with the S3 driver, it must have sometime to do with the posix implementation.

assaron commented 2 months ago

@jreadey these downtimes seem to be still associated with list keys operation.

Do I get this right, that the DN nodes supposed to do a healthCheck every 10 seconds where it tells to the head node that everything's OK, otherwise the head node can say the the DN node is no longer active? Because I see this in DN log (I only showing healthCheck and listKeys lines):

2024-04-01T21:19:46.892908905Z INFO> healthCheck - node_state: READY
2024-04-01T21:19:56.895711023Z INFO> healthCheck - node_state: READY
2024-04-01T21:20:06.899823404Z INFO> healthCheck - node_state: READY
2024-04-01T21:20:15.247067053Z INFO> list_keys('db/12fff16b-c5b44bf9/','','' include_stats=True, bucket=public, callback set
2024-04-01T21:21:54.743319810Z INFO> listKeys done, got 93520 keys  
2024-04-01T21:21:54.747772575Z INFO> healthCheck - node_state: READY
2024-04-01T21:22:04.758073075Z INFO> healthCheck - node_state: READY
2024-04-01T21:22:14.761884799Z INFO> healthCheck - node_state: READY
2024-04-01T21:22:14.842243885Z INFO> list_keys('db/12fff16b-c5b44bf9/','','' include_stats=True, bucket=public, callback set
2024-04-01T21:23:58.673291706Z INFO> listKeys done, got 93537 keys  
2024-04-01T21:23:58.773766736Z INFO> healthCheck - node_state: READY
2024-04-01T21:24:08.968729444Z INFO> healthCheck - node_state: READY
2024-04-01T21:24:18.973231242Z INFO> healthCheck - node_state: READY

So healthCheck is not happening for the duration of listKeys operation.

Here is the part of SN node log with update on node statuses for the same period:

2024-04-01T21:20:27.072151585Z INFO> update_dn_info - dn_nodes: {'dn-5fbd7'} are no longer active
2024-04-01T21:21:58.111102272Z INFO> update_dn_info - dn_nodes: {'dn-5fbd7'} are now active
2024-04-01T21:22:28.122799490Z INFO> update_dn_info - dn_nodes: {'dn-8daf7'} are no longer active
2024-04-01T21:22:38.126361635Z INFO> update_dn_info - dn_nodes: {'dn-5fbd7'} are no longer active

I don't know how to check it properly, but I assume the DN node which log I showed above is this dn-5fbd7.

This looks suspicious to me. I even double-checked that I'm running the docker I built myself from the repo and it seemed fine.

jreadey commented 2 months ago

The way asyncio works in python, until a task does an "await" no other task can run. The DN containers send a periodic health check to let the Head contain know that it hasn't crashed or gone away. From your log timings, it seems like list_keys was running for over a minute before the healthCheck task could run. This caused the head node to think something had gone wrong and mark the container as inactive.

My update put in some awaiti's which I thought would solve the issue. I'll take another look. Until then you can use the head_sleep_time/node_sleep_time config to allow more time between health check runs. Probably 120 is a better setting than 60 judging from the logs.

assaron commented 2 months ago

Just to note, I used the default config and didn't change sleep time from 10 to 60 seconds as you suggested. Actually, when I think about it, I find it to be a way to just hide the problem: the node won't be pronounced inactive, but it won't respond to any requests. Or am I wrong here?

Anyway, I've tried to look a bit closer how this sleep work and added the following debug output code:

diff --git a/hsds/util/fileClient.py b/hsds/util/fileClient.py
index 6f70c03..360e731 100644
--- a/hsds/util/fileClient.py
+++ b/hsds/util/fileClient.py
@@ -354,7 +354,9 @@ class FileClient:
                     if limit and len(files) >= limit:
                         break
                     if len(files) % 1000 == 0:
+                        log.info(f"Sleeping to make room for other tasks, len(files) = {len(files)}")
                         await asyncio.sleep(0)
+                        log.info(f"Woke up after sleep, len(files) = {len(files)}")
                 break  # don't recurse into subdirs

             else:
@@ -368,7 +370,9 @@ class FileClient:
                     if limit and len(files) >= limit:
                         break
                     if len(files) % 1000 == 0:
+                        log.info(f"Sleeping to make room for other tasks, len(files) = {len(files)}")
                         await asyncio.sleep(0)
+                        log.info(f"Woke up after sleep, len(files) = {len(files)}")

         # use a dictionary to hold return values if stats are needed
         key_names = {} if include_stats else []

Unexpectedly to me, here are the DN logs that I got:

2024-04-02T21:50:59.160583053Z INFO> healthCheck - node_state: READY
2024-04-02T21:51:07.013573783Z INFO> root_scan id g-dc3a2ac5-e77e9e7e-54b2-a24d6f-f616f6: bucket: public timestamp: 1712094649.9877658
2024-04-02T21:51:07.013619339Z INFO> add g-dc3a2ac5-e77e9e7e-54b2-a24d6f-f616f6 to scan list
2024-04-02T21:51:07.013622770Z INFO> bucketScan for: g-dc3a2ac5-e77e9e7e-54b2-a24d6f-f616f6 bucket: public
2024-04-02T21:51:07.013625826Z INFO> scanRoot for rootid: g-dc3a2ac5-e77e9e7e-54b2-a24d6f-f616f6 bucket: public
2024-04-02T21:51:07.013628549Z INFO> getStorKeys('db/dc3a2ac5-e77e9e7e/','','', include_stats=True
2024-04-02T21:51:07.013631109Z INFO> list_keys('db/dc3a2ac5-e77e9e7e/','','' include_stats=True, bucket=public, callback set
2024-04-02T21:51:07.052441461Z INFO> Sleeping to make room for other tasks, len(files) = 1000
2024-04-02T21:51:07.052477017Z INFO> bucketGC - sleep: 10
2024-04-02T21:51:07.052480662Z INFO> Woke up after sleep, len(files) = 1000
2024-04-02T21:51:07.053200613Z INFO> Sleeping to make room for other tasks, len(files) = 2000
2024-04-02T21:51:07.053255211Z INFO> Woke up after sleep, len(files) = 2000
2024-04-02T21:51:07.054526602Z INFO> Sleeping to make room for other tasks, len(files) = 3000
2024-04-02T21:51:07.054540778Z INFO> Woke up after sleep, len(files) = 3000
...
2024-04-02T21:51:07.117914336Z INFO> Sleeping to make room for other tasks, len(files) = 39000
2024-04-02T21:51:07.117923630Z INFO> Woke up after sleep, len(files) = 39000
2024-04-02T21:51:07.119697133Z INFO> Sleeping to make room for other tasks, len(files) = 40000
2024-04-02T21:51:07.119706384Z INFO> Woke up after sleep, len(files) = 40000
2024-04-02T21:51:07.122643356Z INFO> _getFileStats(/data/public/db/dc3a2ac5-e77e9e7e/.group.json) returning: {'ETag': 'd7ac024253a164a810aeddae3f35748a', 'Size': 424, 'LastModified': 1712093656.8340442}
2024-04-02T21:51:07.123250149Z INFO> _getFileStats(/data/public/db/dc3a2ac5-e77e9e7e/g/5ce0-878d2f-cd7156/.group.json) returning: {'ETag': 'e2a5d8600fbfa1535a5c4443f4deebac', 'Size': 789, 'LastModified': 1712093656.9020455}
2024-04-02T21:51:07.123716373Z INFO> _getFileStats(/data/public/db/dc3a2ac5-e77e9e7e/g/8b75-fdd920-e72fff/.group.json) returning: {'ETag': 'b0c74b097a19d2c3ce4065a315082f93', 'Size': 315, 'LastModified': 1712093656.8980453}
...
2024-04-02T21:51:43.547917647Z INFO> _getFileStats(/data/public/db/dc3a2ac5-e77e9e7e/d/b667-402278-d0c31e/.dataset.json) returning: {'ETag': '5395070236c955fa4b92ae1a533fac5d', 'Size': 606, 'LastModified': 171209
3656.5580394}
2024-04-02T21:51:43.548071056Z INFO> _getFileStats(/data/public/db/dc3a2ac5-e77e9e7e/d/4255-2b1ce7-86436c/.dataset.json) returning: {'ETag': '2327d31616d0d06bed4cb5f3b73fc151', 'Size': 601, 'LastModified': 171209
3656.3460357}
2024-04-02T21:51:43.548403789Z INFO> _getFileStats(/data/public/db/dc3a2ac5-e77e9e7e/d/165f-7c2031-fc128f/.dataset.json) returning: {'ETag': '0df793bea6f577d6ba56c758c1b3d35a', 'Size': 606, 'LastModified': 171209
3655.9380286}
2024-04-02T21:51:44.385422497Z INFO> listKeys done, got 40018 keys
2024-04-02T21:51:44.385908904Z INFO> getStorKeys done for prefix: db/dc3a2ac5-e77e9e7e/
2024-04-02T21:51:44.385928357Z INFO> scanRoot - got 40018 keys for rootid: g-dc3a2ac5-e77e9e7e-54b2-a24d6f-f616f6
2024-04-02T21:51:44.385933526Z INFO> got dataset: d-dc3a2ac5-e77e9e7e-e101-bff066-28ed58: {'lastModified': 1712093655.8820276, 'num_chunks': 0, 'allocated_bytes': 0, 'logical_bytes': 0, 'linked_bytes': 0, 'num_linked_chunks': 0}
2024-04-02T21:51:44.385980976Z INFO> getStorJSONObj(public)/db/dc3a2ac5-e77e9e7e/d/e101-bff066-28ed58/.dataset.json
2024-04-02T21:51:44.386015939Z INFO> get_object - filepath: /data/public/db/dc3a2ac5-e77e9e7e/d/e101-bff066-28ed58/.dataset.json
2024-04-02T21:51:44.387100131Z INFO> healthCheck - node_state: READY

All of these sleeps happened before the file stats commands has been executed, instead of in between! I don't think it's an expected behavior...

assaron commented 2 months ago

I guess these sleeps should be done in the next cycle actually. Let me try to do that.

assaron commented 2 months ago

@jreadey indeed, the sleep in the later for fixed the problem completely. There is no downtime at all and hsload finished on the file with the default 3 retries!

I've made a PR with the patch: https://github.com/HDFGroup/hsds/pull/337

jreadey commented 2 months ago

Thanks for the PR. Change is merged into master.