HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
129 stars 53 forks source link

HSDS Posix on openshift, missing bucket? #75

Closed itsMeBrice closed 3 years ago

itsMeBrice commented 3 years ago

I'm trying to deploy an HSDS server onto an openshift instance. The HSDS server should store its data in a POSIX way onto a "persistant volume claim". When I try to get the domains, do an hsinfo or try to create an hsds file I get responses from the server that seem to point to a missing bucket:

Request GET Domains
Response 404 Not found
Server Response

REQ> GET: /domains [#URL#]
DEBUG> num tasks: 15 active tasks: 7
DEBUG> no Authorization in header
INFO> get_domains for: / verbose: False
DEBUG> get_domains - no limit
INFO> get_domains - prefix: / bucket: hsds
DEBUG> get_domains - listing S3 keys for
DEBUG> _getStorageClient getting FileClient
INFO> getStorKeys('','/','', include_stats=False
INFO> list_keys('','/','', include_stats=False, bucket=hsds
DEBUG> fileClient listKeys for directory: /data/hsds
WARN> listkeys - /data/hsds not found

Command hsinfo
Response:

server name: Highly Scalable Data Service (HSDS)
server state: READY
endpoint: #URL#
username: #USER#
password: #PW#
Error: [Errno 404] Not Found

Server Response:

DEBUG> info request
INFO RSP> <200> (OK): /info
REQ> GET: /about [#URL#]
DEBUG> num tasks: 9 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
INFO RSP> <200> (OK): /about
REQ> GET: / [hsds/home]
DEBUG> num tasks: 9 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
DEBUG> GET_Domain domain: hsds/home bucket: hsds
INFO> got domain: hsds/home
INFO> getDomainJson(hsds/home, reload=True)
DEBUG> LRU ChunkCache node hsds/home removed from ChunkCache
DEBUG> ID hsds/home resolved to data node 0, out of 1 data partitions.
DEBUG> got dn_url: http://#IP#:6101 for obj_id: hsds/home
DEBUG> sending dn req: http://#IP#:6101/domains params: {'domain': 'hsds/home'}
INFO> http_get('http://#IP#:6101/domains' )
INFO> http_get status: 200
DEBUG> setitem, key: hsds/home
DEBUG> LRU ChunkCache adding 1024 to cache, mem_size is now: 1024
DEBUG> LRU ChunkCache added new node: hsds/home [1024 bytes]
DEBUG> got domain_json: {'owner': '#USER#', 'acls': {'#USER#': {'create': True, 'read': True, 'update': True, 'delete': True, 'readACL': True, 'updateACL': True}, 'default': {'create': False, 'read': True, 'update': False, 'delete': False, 'readACL': False, 'updateACL': False}}, 'created': 1605698579.8284595, 'lastModified': 1605698579.8284595}
INFO> aclCheck: read for user: #USER#
DEBUG> href parent domain: hsds/
INFO RSP> <200> (OK): /
REQ> GET: /domains [hsds/home/]
DEBUG> num tasks: 9 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
INFO> get_domains for: /home/ verbose: True
DEBUG> get_domains - using Limit: 1000
INFO> get_domains - prefix: /home/ bucket: hsds
DEBUG> get_domains - listing S3 keys for home/
DEBUG> _getStorageClient getting FileClient
INFO> getStorKeys('home/','/','', include_stats=False
INFO> list_keys('home/','/','', include_stats=False, bucket=hsds
DEBUG> fileClient listKeys for directory: /data/hsds/home/
WARN> listkeys - /data/hsds/home/ not found

Command hstouch -u #USER# -p #PW# -u #USER# /home/#USER#/test.h5
Server Response:

REQ> GET: / [hsds/home/#USER#]
DEBUG> num tasks: 12 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
DEBUG> GET_Domain domain: hsds/home/#USER# bucket: hsds
INFO> got domain: hsds/home/#USER#
INFO> getDomainJson(hsds/home/#USER#, reload=True)
DEBUG> LRU ChunkCache node hsds/home/#USER# removed from ChunkCache
DEBUG> ID hsds/home/#USER# resolved to data node 0, out of 1 data partitions.
DEBUG> got dn_url: http://#IP#:6101 for obj_id: hsds/home/#USER#
DEBUG> sending dn req: http://#IP#:6101/domains params: {'domain': 'hsds/home/#USER#'}
INFO> http_get('http://#IP#:6101/domains' )
INFO> http_get status: 200
DEBUG> setitem, key: hsds/home/#USER#
DEBUG> LRU ChunkCache adding 1024 to cache, mem_size is now: 2048
DEBUG> LRU ChunkCache added new node: hsds/home/#USER# [1024 bytes]
DEBUG> got domain_json: {'owner': '#USER#', 'acls': {'#USER#': {'create': True, 'read': True, 'update': True, 'delete': True, 'readACL': True, 'updateACL': True}, 'default': {'create': False, 'read': True, 'update': False, 'delete': False, 'readACL': False, 'updateACL': False}}, 'created': 1605702901.1704721, 'lastModified': 1605702901.1704721}
INFO> aclCheck: read for user: #USER#
DEBUG> href parent domain: hsds/home
INFO RSP> <200> (OK): /
REQ> GET: / [hsds/home/#USER#/test.h5]
DEBUG> num tasks: 12 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
DEBUG> GET_Domain domain: hsds/home/#USER#/test.h5 bucket: hsds
INFO> got domain: hsds/home/#USER#/test.h5
INFO> getDomainJson(hsds/home/#USER#/test.h5, reload=True)
DEBUG> ID hsds/home/#USER#/test.h5 resolved to data node 0, out of 1 data partitions.
DEBUG> got dn_url: http://#IP#:6101 for obj_id: hsds/home/#USER#/test.h5
DEBUG> sending dn req: http://#IP#:6101/domains params: {'domain': 'hsds/home/#USER#/test.h5'}
INFO> http_get('http://#IP#:6101/domains' )
INFO> http_get status: 500
WARN> request to http://#IP#:6101/domains failed with code: 500
ERROR> Error for http_get_json(http://#IP#:6101/domains): 500
REQ> GET: / [hsds/home/#USER#/test.h5]
DEBUG> num tasks: 12 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
DEBUG> GET_Domain domain: hsds/home/#USER#/test.h5 bucket: hsds
INFO> got domain: hsds/home/#USER#/test.h5
INFO> getDomainJson(hsds/home/#USER#/test.h5, reload=True)
DEBUG> ID hsds/home/#USER#/test.h5 resolved to data node 0, out of 1 data partitions.
DEBUG> got dn_url: http://#IP#:6101 for obj_id: hsds/home/#USER#/test.h5
DEBUG> sending dn req: http://#IP#:6101/domains params: {'domain': 'hsds/home/#USER#/test.h5'}
INFO> http_get('http://#IP#:6101/domains' )
INFO> http_get status: 500
WARN> request to http://#IP#:6101/domains failed with code: 500
ERROR> Error for http_get_json(http://#IP#:6101/domains): 500
REQ> GET: / [hsds/home/#USER#/test.h5]
DEBUG> num tasks: 12 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
DEBUG> GET_Domain domain: hsds/home/#USER#/test.h5 bucket: hsds
INFO> got domain: hsds/home/#USER#/test.h5
INFO> getDomainJson(hsds/home/#USER#/test.h5, reload=True)
DEBUG> ID hsds/home/#USER#/test.h5 resolved to data node 0, out of 1 data partitions.
DEBUG> got dn_url: http://#IP#:6101 for obj_id: hsds/home/#USER#/test.h5
DEBUG> sending dn req: http://#IP#:6101/domains params: {'domain': 'hsds/home/#USER#/test.h5'}
INFO> http_get('http://#IP#:6101/domains' )
INFO> http_get status: 500
WARN> request to http://#IP#:6101/domains failed with code: 500
ERROR> Error for http_get_json(http://#IP#:6101/domains): 500
REQ> GET: / [hsds/home/#USER#/test.h5]
DEBUG> num tasks: 12 active tasks: 6
DEBUG> validateUserPassword username: #USER#
DEBUG> looking up username: #USER#
DEBUG> user password validated
DEBUG> GET_Domain domain: hsds/home/#USER#/test.h5 bucket: hsds
INFO> got domain: hsds/home/#USER#/test.h5
INFO> getDomainJson(hsds/home/#USER#/test.h5, reload=True)
DEBUG> ID hsds/home/#USER#/test.h5 resolved to data node 0, out of 1 data partitions.
DEBUG> got dn_url: http://#IP#:6101 for obj_id: hsds/home/#USER#/test.h5
DEBUG> sending dn req: http://#IP#:6101/domains params: {'domain': 'hsds/home/#USER#/test.h5'}
INFO> http_get('http://#IP#:6101/domains' )
INFO> http_get status: 500
WARN> request to http://#IP#:6101/domains failed with code: 500
ERROR> Error for http_get_json(http://#IP#:6101/domains): 500

I tried the fix as described in https://github.com/HDFGroup/hsds/issues/13, this fails with the following response. I do think this may be because the script is meant for S3 storage though. Command: python create_toplevel_domain_json.py --user=#USER# --domain=/home
Server Response:

got environment override for config-dir: ../#USER#/config/
checking config path: ../#USER#/config/config.yml
_load_cfg with '../#USER#/config/config.yml'
got env value override for hsds_endpoint
got env value override for root_dir
got env value override for bucket_name
got env value override for log_level
domain: /home
domain: hsds/home
s3_key: home/.domain.json
DEBUG> _getStorageClient getting FileClient
DEBUG> isStorObj hsds/home/.domain.json
INFO> is_key - filepath: /data/hsds/home/.domain.json
DEBUG> isStorObj home/.domain.json returning False
INFO> writing domain
DEBUG> _getStorageClient getting FileClient
INFO> putS3JSONObj(hsds/home/.domain.json)
WARN> fileClient.put_object - bucket at path: /data/hsds not found
Traceback (most recent call last):
File "create_toplevel_domain_json.py", line 181, in
main()
File "create_toplevel_domain_json.py", line 170, in main
loop.run_until_complete(createDomains(app, usernames, default_perm, domain_name=domain))
File "/usr/local/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "create_toplevel_domain_json.py", line 87, in createDomains
await createDomain(app, domain, domain_json)
File "create_toplevel_domain_json.py", line 104, in createDomain
await putStorJSONObj(app, s3_key, domain_json)
File "/usr/local/lib/python3.8/site-packages/hsds/util/storUtil.py", line 307, in putStorJSONObj
rsp = await client.put_object(key, data, bucket=bucket)
File "/usr/local/lib/python3.8/site-packages/hsds/util/fileClient.py", line 154, in put_object
raise HTTPNotFound()
aiohttp.web_exceptions.HTTPNotFound: Not Found

jreadey commented 3 years ago

If you exec into the DN container, do you see the path specified by the ROOT_DIR environment variable?

My first thought though is that PVC's would not be suitable for HSDS. Is it possible for multiple pods to have read/write access to a PVC in OpenShift?

Have you investigated the use of Object Storage with OpenShift?

itsMeBrice commented 3 years ago

The ROOT_DIR is exactly as configured and points to the mounted Persistant Volume Claim. The PVC's in openshift have different configuration modes "Single User (RWO), Shared Access (RWX), Read Only (ROX)". I have configured it as RWX (Read-Write-Many) which should be the mode for different pods to access and write on the storage simultaneously. As far as I see it Object Storage in openshift mostly works as a layer underneath PVC's for organizing the storage. I am looking into it a bit more though.

jreadey commented 3 years ago

@itsMeBrice - where you able to get this working?

itsMeBrice commented 3 years ago

I was trying to get it to run with the storage on an minio instance which is also running on the openshift cluster. This appears to work fine for small datasets. As soon as i save a couple million datapoints and try to read them back i run into problems. The last portion of a long dataset is usually readable without any problem. But if i try to read even a single datapoint nearer to the front (where exactly the border is from where data goes from readable to unreadable I can't say) via h5pyd. I get the following error:

---------------------------------------------------------------------------
MaxRetryError                             Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    438             if not chunked:
--> 439                 resp = conn.urlopen(
    440                     method=request.method,

/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    816             log.debug("Retry: %s", url)
--> 817             return self.urlopen(
    818                 method,

/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    816             log.debug("Retry: %s", url)
--> 817             return self.urlopen(
    818                 method,

/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    816             log.debug("Retry: %s", url)
--> 817             return self.urlopen(
    818                 method,

/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    806             try:
--> 807                 retries = retries.increment(method, url, response=response, _pool=self)
    808             except MaxRetryError:

/opt/conda/lib/python3.8/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    445         if new_retry.is_exhausted():
--> 446             raise MaxRetryError(_pool, url, error or ResponseError(cause))
    447 

MaxRetryError: HTTPConnectionPool(host='vorn-hsds-ccom-hsds-sandbox.appuiodcs1app.ch', port=80): Max retries exceeded with url: /datasets/d-2946a6a4-88833ed7-bdbb-48d0d4-fc0787/value?nonstrict=1&select=%5B4000000%3A8063232%3A1%2C0%3A4%3A1%5D&domain=%2Fhome%2Ftest (Caused by ResponseError('too many 500 error responses'))

During handling of the above exception, another exception occurred:

RetryError                                Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_hl/dataset.py in __getitem__(self, args)
    852                     try:
--> 853                         rsp = self.GET(req, params=params, format="binary")
    854                     except IOError as ioe:

/opt/conda/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_hl/base.py in GET(self, req, params, use_cache, format)
    888 
--> 889         rsp = self.id._http_conn.GET(req, params=params, headers=headers, format=format, use_cache=use_cache)
    890         if rsp.status_code != 200:

/opt/conda/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_hl/httpconn.py in GET(self, req, format, params, headers, use_cache)
    282             s = self.session
--> 283             rsp = s.get(self._endpoint + req, params=params, headers=headers, auth=auth, verify=self.verifyCert())
    284             self.log.info("status: {}".format(rsp.status_code))

/opt/conda/lib/python3.8/site-packages/requests/sessions.py in get(self, url, **kwargs)
    542         kwargs.setdefault('allow_redirects', True)
--> 543         return self.request('GET', url, **kwargs)
    544 

/opt/conda/lib/python3.8/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    529         send_kwargs.update(settings)
--> 530         resp = self.send(prep, **send_kwargs)
    531 

/opt/conda/lib/python3.8/site-packages/requests/sessions.py in send(self, request, **kwargs)
    642         # Send the request
--> 643         r = adapter.send(request, **kwargs)
    644 

/opt/conda/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    506             if isinstance(e.reason, ResponseError):
--> 507                 raise RetryError(e, request=request)
    508 

RetryError: HTTPConnectionPool(host='vorn-hsds-ccom-hsds-sandbox.appuiodcs1app.ch', port=80): Max retries exceeded with url: /datasets/d-2946a6a4-88833ed7-bdbb-48d0d4-fc0787/value?nonstrict=1&select=%5B4000000%3A8063232%3A1%2C0%3A4%3A1%5D&domain=%2Fhome%2Ftest (Caused by ResponseError('too many 500 error responses'))

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-11-3c8e790ddd34> in <module>
----> 5     dataArr = np.array(dataSet[4000000:,:])

/opt/conda/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_hl/dataset.py in __getitem__(self, args)
    860                             break
    861                         else:
--> 862                             raise IOError("Error retrieving data: {}".format(ioe.errno))
    863                     if type(rsp) is bytes:
    864                         # got binary response

OSError: Error retrieving data: None

If I try the same dataset on a hsds instance running natively on POSIX storage all the data is available.

p.s. Do you have any advice for adding keycloak? See issue -> https://github.com/HDFGroup/hsds/issues/74

jreadey commented 3 years ago

Take a look at the HSDS server logs - there might be some clues there.

I'll do some testing with Minio and see if I run into any problems. Minio supports the AWS S3 API, but there can be small details that trip things up.

Advantage of using Minio is that you'll have replication of all the data (I think each object gets stored on three different disks by default). POSIX storage will likely be faster, but if a PV crashes, you'll lose data. Also Minio should scale better for really large installations (say >50 HSDS nodes).

jreadey commented 3 years ago

Worked with @itsMeBrice offline to setup HSDS with Minio and it looks like it's working now. FYI for any other Minio users: if you are using Minio with an NGINX proxy, you'll want to be sure that NGINX isn't blocking larger requests. By default it's just 1MB, so a large hyperslab selection could fail.