Closed itsMeBrice closed 3 years ago
If you exec into the DN container, do you see the path specified by the ROOT_DIR environment variable?
My first thought though is that PVC's would not be suitable for HSDS. Is it possible for multiple pods to have read/write access to a PVC in OpenShift?
Have you investigated the use of Object Storage with OpenShift?
The ROOT_DIR is exactly as configured and points to the mounted Persistant Volume Claim. The PVC's in openshift have different configuration modes "Single User (RWO), Shared Access (RWX), Read Only (ROX)". I have configured it as RWX (Read-Write-Many) which should be the mode for different pods to access and write on the storage simultaneously. As far as I see it Object Storage in openshift mostly works as a layer underneath PVC's for organizing the storage. I am looking into it a bit more though.
@itsMeBrice - where you able to get this working?
I was trying to get it to run with the storage on an minio instance which is also running on the openshift cluster. This appears to work fine for small datasets. As soon as i save a couple million datapoints and try to read them back i run into problems. The last portion of a long dataset is usually readable without any problem. But if i try to read even a single datapoint nearer to the front (where exactly the border is from where data goes from readable to unreadable I can't say) via h5pyd. I get the following error:
---------------------------------------------------------------------------
MaxRetryError Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
438 if not chunked:
--> 439 resp = conn.urlopen(
440 method=request.method,
/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
816 log.debug("Retry: %s", url)
--> 817 return self.urlopen(
818 method,
/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
816 log.debug("Retry: %s", url)
--> 817 return self.urlopen(
818 method,
/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
816 log.debug("Retry: %s", url)
--> 817 return self.urlopen(
818 method,
/opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
806 try:
--> 807 retries = retries.increment(method, url, response=response, _pool=self)
808 except MaxRetryError:
/opt/conda/lib/python3.8/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
445 if new_retry.is_exhausted():
--> 446 raise MaxRetryError(_pool, url, error or ResponseError(cause))
447
MaxRetryError: HTTPConnectionPool(host='vorn-hsds-ccom-hsds-sandbox.appuiodcs1app.ch', port=80): Max retries exceeded with url: /datasets/d-2946a6a4-88833ed7-bdbb-48d0d4-fc0787/value?nonstrict=1&select=%5B4000000%3A8063232%3A1%2C0%3A4%3A1%5D&domain=%2Fhome%2Ftest (Caused by ResponseError('too many 500 error responses'))
During handling of the above exception, another exception occurred:
RetryError Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_hl/dataset.py in __getitem__(self, args)
852 try:
--> 853 rsp = self.GET(req, params=params, format="binary")
854 except IOError as ioe:
/opt/conda/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_hl/base.py in GET(self, req, params, use_cache, format)
888
--> 889 rsp = self.id._http_conn.GET(req, params=params, headers=headers, format=format, use_cache=use_cache)
890 if rsp.status_code != 200:
/opt/conda/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_hl/httpconn.py in GET(self, req, format, params, headers, use_cache)
282 s = self.session
--> 283 rsp = s.get(self._endpoint + req, params=params, headers=headers, auth=auth, verify=self.verifyCert())
284 self.log.info("status: {}".format(rsp.status_code))
/opt/conda/lib/python3.8/site-packages/requests/sessions.py in get(self, url, **kwargs)
542 kwargs.setdefault('allow_redirects', True)
--> 543 return self.request('GET', url, **kwargs)
544
/opt/conda/lib/python3.8/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
529 send_kwargs.update(settings)
--> 530 resp = self.send(prep, **send_kwargs)
531
/opt/conda/lib/python3.8/site-packages/requests/sessions.py in send(self, request, **kwargs)
642 # Send the request
--> 643 r = adapter.send(request, **kwargs)
644
/opt/conda/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
506 if isinstance(e.reason, ResponseError):
--> 507 raise RetryError(e, request=request)
508
RetryError: HTTPConnectionPool(host='vorn-hsds-ccom-hsds-sandbox.appuiodcs1app.ch', port=80): Max retries exceeded with url: /datasets/d-2946a6a4-88833ed7-bdbb-48d0d4-fc0787/value?nonstrict=1&select=%5B4000000%3A8063232%3A1%2C0%3A4%3A1%5D&domain=%2Fhome%2Ftest (Caused by ResponseError('too many 500 error responses'))
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
<ipython-input-11-3c8e790ddd34> in <module>
----> 5 dataArr = np.array(dataSet[4000000:,:])
/opt/conda/lib/python3.8/site-packages/h5pyd-0.8.0-py3.8.egg/h5pyd/_hl/dataset.py in __getitem__(self, args)
860 break
861 else:
--> 862 raise IOError("Error retrieving data: {}".format(ioe.errno))
863 if type(rsp) is bytes:
864 # got binary response
OSError: Error retrieving data: None
If I try the same dataset on a hsds instance running natively on POSIX storage all the data is available.
p.s. Do you have any advice for adding keycloak? See issue -> https://github.com/HDFGroup/hsds/issues/74
Take a look at the HSDS server logs - there might be some clues there.
I'll do some testing with Minio and see if I run into any problems. Minio supports the AWS S3 API, but there can be small details that trip things up.
Advantage of using Minio is that you'll have replication of all the data (I think each object gets stored on three different disks by default). POSIX storage will likely be faster, but if a PV crashes, you'll lose data. Also Minio should scale better for really large installations (say >50 HSDS nodes).
Worked with @itsMeBrice offline to setup HSDS with Minio and it looks like it's working now. FYI for any other Minio users: if you are using Minio with an NGINX proxy, you'll want to be sure that NGINX isn't blocking larger requests. By default it's just 1MB, so a large hyperslab selection could fail.
I'm trying to deploy an HSDS server onto an openshift instance. The HSDS server should store its data in a POSIX way onto a "persistant volume claim". When I try to get the domains, do an hsinfo or try to create an hsds file I get responses from the server that seem to point to a missing bucket:
Request
GET Domains
Response
404 Not found
Server Response
Command
hsinfo
Response:
Server Response:
Command
hstouch -u #USER# -p #PW# -u #USER# /home/#USER#/test.h5
Server Response:
I tried the fix as described in https://github.com/HDFGroup/hsds/issues/13, this fails with the following response. I do think this may be because the script is meant for S3 storage though. Command:
python create_toplevel_domain_json.py --user=#USER# --domain=/home
Server Response: