HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
128 stars 52 forks source link

Error regarding FileClient #107

Closed bilalshaikh42 closed 2 years ago

bilalshaikh42 commented 2 years ago

Hello,

We are receiving 500 errors for every path that we query, except for /info. Our backend is an s3 bucket, that seems to be working fine. Here is the error message


6
ERROR> FileClient init: root_dir config not set
5
ERROR> FileClient init: root_dir config not set
4
ERROR> FileClient init: root_dir config not set
3
ERROR> FileClient init: root_dir config not set
2
ERROR> FileClient init: root_dir config not set
1
ERROR> FileClient init: root_dir config not set`

We are running tag v0.7beta8 on K8s, though we had the issue with v0.7.0beta7 as well. No changes were made to the config through the cluster was redeployed

Any pointers on how to debug?

jreadey commented 2 years ago

Looks like HSDS is trying to use posix storage rather than S3. Are you setting the AWS_ACCESS_KEY_ID and AWS_SECRET_KEY_ID environment variables? See the AWS deployment yaml: https://github.com/HDFGroup/hsds/blob/master/admin/kubernetes/k8s_deployment_aws.yml,

jreadey commented 2 years ago

Note that the AWS keys are set using the aws-auth-keys secret. Do you have these set? $ kubectl describe secrets aws-auth-keys.

Not related to the error you are seeing, but note that in v0.7beta8, the HEAD_PORT env variable needs to be set to null (as in the above yaml). These is used to toggle between deployments where the pods work together (HEAD_PORT is null) and independently (HEAD_PORT points to a head container on the pod).

bilalshaikh42 commented 2 years ago

Hello, sorry for the delay. We have the aws-auth-keys secret set. This has not changed recently. Here is the config we have currently.

HSDS configuration
allow_noauth: true # enable unauthenticated requests
auth_expiration: -1 # set an expiration for credential caching
default_public: false # new domains are publically readable by default
aws_access_key_id: null # Replace with access key for account or use aws_iam_role
aws_secret_access_key: null # Replace with secret key for account
aws_iam_role: null # For EC2 using IAM roles
aws_region: null
hsds_endpoint: https://data.biosimulations.dev # used for hateos links in response
aws_s3_gateway: http://s3low.scality.uchc.edu # use endpoint for the region HSDS is running in, e.g. 'https://s3.amazonaws.com' for us-east-1
aws_dynamodb_gateway: null # use for dynamodb endpint, e.g. 'https://dynamodb.us-east-1.amazonaws.com',
aws_lambda_gateway: null #  use lambda endpoint for region HSDS is running in.  See: https://docs.aws.amazon.com/general/latest/gr/lambda-service.html
aws_dynamodb_users_table: null # set to table name if lambda is used to store usernames and passwords
aws_lambda_chunkread_function: null # name of aws lambda function for chunk reading
aws_lambda_threshold: 4 # number of chunks per node per request to reach before using lambda
aws_lambda_max_invoke: 1000 # max number of lambda functions to invoke simultaneously
azure_connection_string: null # use for connecting to Azure blob storage
azure_resource_id: null # resource id for use with Azure Active Directory
azure_storage_account: null # storage account to use on Azure
azure_resource_group: null # Azure resource group the container (BUCKET_NAME) belongs to
root_dir: / # base directory to use for Posix storage
password_salt: null # salt value if dynamically generated passwords are used
bucket_name: biosimdev # set to use a default bucket, otherwise bucket param is needed for all requests
head_port: 5100 # port to use for head node
head_ram: 512m # memory for head container
dn_port: 6101 # Start dn ports at 6101
dn_ram: 3g # memory for DN container (per container)
sn_port: 5101 # Start sn ports at 5101
sn_ram: 1g # memory for SN container
rangeget_port: 6900 # singleton proxy at port 6900
rangeget_ram: 1g # memoruy for RANGEGET container
target_sn_count: 0 # desired number of SN containers
target_dn_count: 0 # desire number of DN containers
log_level: WARNING # log level.  One of ERROR, WARNING, INFO, DEBUG
log_prefix: null
max_tcp_connections: 100 # max number of inflight tcp connections
head_sleep_time: 10 # max sleep time between health checks for head node
node_sleep_time: 10 # max sleep time between health checks for SN/DN nodes
async_sleep_time: 10 # max sleep time between async task runs
s3_sync_interval: 1 # time to wait to write object data to S3 (in sec)
s3_sync_task_timeout: 10 # time to cancel write task if no response
max_pending_write_requests: 20 # maxium number of inflight write requests
flush_sleep_interval: 1 # time to wait between checking on dirty objects
max_chunks_per_request: 2000 # maximum number of chunks to be serviced by one request
min_chunk_size: 1m # 1 MB
max_chunk_size: 4m # 4 MB
max_request_size: 100m # 100 MB - should be no smaller than client_max_body_size in nginx tmpl
max_chunks_per_folder: 0 # max number of chunks per s3 folder. 0 for unlimiited
max_task_count: 200 # maximum number of concurrent tasks before server will return 503 error
aio_max_pool_connections: 64 # number of connections to keep in conection pool for aiobotocore requests
metadata_mem_cache_size: 128m # 128 MB - metadata cache size per DN node
metadata_mem_cache_expire: 3600 # expire cache items after one hour
chunk_mem_cache_size: 128m # 128 MB - chunk cache size per DN node
chunk_mem_cache_expire: 3600 # expire cache items after one hour
data_cache_size: 128m # cache for rangegets
data_cache_expire_time: 3600 # expire cache items after one hour
data_cache_page_size: 4m # page size for range get cache, set to zero to disable proxy
data_cache_max_concurrent_read: 32 # maximum number of inflight storage read requests
timeout: 40 # http timeout - 30 sec
password_file: "/config/passwd.txt" # filepath to a text file of username/passwords. set to '' for no-auth access
groups_file: /config/groups.txt # filepath to text file defining user groups
server_name: BioSimulations Data Service # this gets returned in the about request
greeting: Thank you for using BioSimulations!
about: The BioSimulations Data Service uses HSDS to allow for efficient and performant operations on simulation data.
top_level_domains: [] # list of possible top-level domains, example: ["/home", "/shared"], if empty all top-level folders in default bucket will be returned
cors_domain: "*" # domains allowed for CORS
admin_user: admin # user with admin privileges
admin_group: null # enable admin privileges for any user in this group
openid_provider: azure # OpenID authentication provider
openid_url: null # OpenID connect endpoint if provider is not azure or google
openid_audience: null # OpenID audience. This is synonymous with azure_resource_id for azure
openid_claims: unique_name,appid,roles # Comma seperated list of claims to resolve to usernames.
chaos_die: 0 # if > 0, have nodes randomly die after n seconds (for testing)
standalone_app: false # True when run as a single application
blosc_nthreads: 2 # number of threads to use for blosc compression.  Set to 0 to have blosc auto-determine thread count
http_compression: false # Use HTTP compression
k8s_app_label: hsds # The app label for k8s deployments
k8s_namespace: dev # Specifies if a the client should be limited to a specific namespace. Useful for some RBAC configurations.
restart_policy: always # Docker restart policy
domain_req_max_objects_limit: 500 # maximum number of objects to return in GET domain request with use_cache

The AWS info is set to null here, but we do have the secrets set as mentioned. This also has not changed recently.

jreadey commented 2 years ago

You have root_dir set to '/'. HSDS sees that as an indication to use posix storage. Try modifying your config to use null.

bilalshaikh42 commented 2 years ago

Sorry, that was set just to debug the error. That was originally null when the error appeared

jreadey commented 2 years ago

Try setting LOG_LEVEL to DEBUG and re-deploying HSDS. Watch the dn log (kubectl logs -f <pod-id> -c dn). At the first storage access you should see one of the _getStorageClient logs here: https://github.com/HDFGroup/hsds/blob/master/hsds/util/storUtil.py#L81. If you see "...getting FileClient" it would indicate the aws_s3_gateway is not getting set somehow.

bilalshaikh42 commented 2 years ago

Here is a more specific log of when the error occurs. I believe this indicates that the s3 bucket is not set properly? Line 17 indicates that there is no bucket, but then line 15 does show the current bucket.

REQ> GET: /domains [biosimdev/results/6192cd87c6677ba77df31bc7]
21
INFO> get_metadata_obj: biosimdev/results/6192cd87c6677ba77df31bc7 bucket: None
20
ERROR> FileClient init: root_dir config not set
19
WARN> HTTPInternalServerError error for biosimdev/results/6192cd87c6677ba77df31bc7 bucket:biosimdev s3key: results/6192cd87c6677ba77df31bc7/.domain.json
18
REQ> GET: /domains [biosimdev/results/6192cd87c6677ba77df31bc7]
17
INFO> get_metadata_obj: biosimdev/results/6192cd87c6677ba77df31bc7 bucket: None
16
ERROR> FileClient init: root_dir config not set
15
WARN> HTTPInternalServerError error for biosimdev/results/6192cd87c6677ba77df31bc7 bucket:biosimdev s3key: results/6192cd87c6677ba77df31bc7/.domain.json

Here is the domain file that the server seems to be trying to access

{"root": "g-b040f90c-8fa6dbcd-38c8-718407-2e5345", "owner": "biosimulations", "acls": {"biosimulations": {"create": true, "read": true, "update": true, "delete": true, "readACL": true, "updateACL": true}, "default": {"create": false, "read": true, "update": false, "delete": false, "readACL": false, "updateACL": false}}, "created": 1637010979.9793882, "lastModified": 1637010979.9793882}
bilalshaikh42 commented 2 years ago

DEBUG

Should I set the log_level as an env variable to the container, or in the configuration?

In case it helps, here is the deployment that we are working on: https://github.com/biosimulations/deployment/blob/main/base/apps/hsds-service.yaml

And here is the config: https://github.com/biosimulations/deployment/blob/main/config/dev/hsds/config.yml

jreadey commented 2 years ago

Either way will be fine.

bilalshaikh42 commented 2 years ago
hsds entrypoint
node type:  dn
running hsds-datanode
INFO> Data node initializing
INFO> data node initializing
Error applying command line override value for key: head_port: invalid literal for int() with base 10: ''
DEBUG> Using metadata memory cache size of: 134217728
DEBUG> Setting metadata cache expire time to: 3600
DEBUG> Using chunk memory cache size of: 134217728
DEBUG> Setting chunk cache expire time to: 3600
DEBUG> Setting blosc nthreads to: 2
INFO> Application baseInit
INFO> using node port: 6101
INFO> setting node_id to: dn-0042e
INFO> using bucket: hsdstest
INFO> running in kubernetes
INFO> using node port: 6101
INFO> aws_iam_role set to: hsds_role
INFO> aws_secret_access_key set
INFO> aws_access_key_id set
INFO> aws_region set to: us-east-1
INFO> run_app on port: 6101
INFO> s3sync - clusterstate is not ready, sleeping
INFO> bucketScan start
INFO> scan_wait_time: 60
INFO> bucketScan waiting for Node state to be READY
INFO> bucketGC start
INFO> async_sleep_time: 10
INFO> bucketGC - waiting for Node state to be READY
======== Running on http://0.0.0.0:6101 ========
(Press CTRL+C to quit)
INFO> health check start
INFO> healthCheck - node_state: INITIALIZING
INFO> k8s_update_dn_info
INFO> s3sync - clusterstate is not ready, sleeping
INFO> http_get status for k8s pods: 200 for req: https://kubernetes.default.svc/api/v1/namespaces/dev/pods
WARN> _k8sGetPodIPs - no app label
WARN> _k8sGetPodIPs - no app label
WARN> _k8sGetPodIPs - no app label
INFO> gotPodIps: ['10.20.0.8', '10.20.3.54', '10.20.5.165', '10.20.4.178', '10.20.2.175', '10.20.5.166', '10.20.2.176', '10.20.0.9', '10.20.4.179', '10.20.3.55']
INFO> http_get('http://10.20.0.8:6101/info')
INFO> Initiating TCPConnector for http://10.20.0.8:6101/info with limit 100 connections
INFO> http_get status: 200 for req: http://10.20.0.8:6101/info
INFO> http_get('http://10.20.0.9:6101/info')
INFO> http_get status: 200 for req: http://10.20.0.9:6101/info
INFO> http_get('http://10.20.2.175:6101/info')
INFO> http_get status: 200 for req: http://10.20.2.175:6101/info
INFO> http_get('http://10.20.2.176:6101/info')
INFO> http_get status: 200 for req: http://10.20.2.176:6101/info
INFO> http_get('http://10.20.3.54:6101/info')
INFO> http_get status: 200 for req: http://10.20.3.54:6101/info
INFO> http_get('http://10.20.3.55:6101/info')
INFO> http_get status: 200 for req: http://10.20.3.55:6101/info
INFO> http_get('http://10.20.4.178:6101/info')
INFO> http_get status: 200 for req: http://10.20.4.178:6101/info
INFO> http_get('http://10.20.4.179:6101/info')
INFO> http_get status: 200 for req: http://10.20.4.179:6101/info
INFO> http_get('http://10.20.5.165:6101/info')
INFO> http_get status: 200 for req: http://10.20.5.165:6101/info
INFO> http_get('http://10.20.5.166:6101/info')
REQ> GET: /info [10.20.5.166:6101]
INFO RSP> <200> (OK): /info
INFO> http_get status: 200 for req: http://10.20.5.166:6101/info
INFO> node_info check dn_ids: ['dn-fee43', 'dn-5a98e', 'dn-8a9ea', 'dn-9efae', 'dn-b213d', 'dn-08c4f', 'dn-2fac9', 'dn-b42db', 'dn-5cc42', 'dn-0042e']
INFO> update_dn_info - dn_nodes: {'dn-8a9ea', 'dn-2fac9', 'dn-5a98e', 'dn-08c4f', 'dn-9efae', 'dn-b42db', 'dn-5cc42', 'dn-0042e', 'dn-fee43', 'dn-b213d'} are now active
INFO> node_number has changed - old value was -1 new number is 9
INFO> setting node_number to: 9, node_state to READY
jreadey commented 2 years ago

The way the code work is that the request handelrs call hsds/util/storUtil.py methods for reading and writing to storage. StorUtil provides a common interface to the different storage classes. When first called, StoreUtil will instantiates one S3Client, AzureBlobClient, or PosixClient (defined in s3Client.py, azureBlobClient.py, or fileClient.py) based on the config settings. I.e. if AWS_S3_GATEWAY is set, it uses the S3Client.

Anyway, if you see log entries like "FileClient..." it indicates that the wrong driver is being used, not a problem with the interaction with S3.

jreadey commented 2 years ago

Try something that needs to read/write to storage to get the StorUtil initialization...

bilalshaikh42 commented 2 years ago

It seems that the log_level is not being loaded in from the configuration, since it was previously set to WARN and now DEBUG, but we are still getting the INFO as well. Is this correct?

jreadey commented 2 years ago

Yes. DEBUG will trigger all higher log levels

jreadey commented 2 years ago

BTW, was your HSDS setup working before? Did something change?

bilalshaikh42 commented 2 years ago

BTW, was your HSDS setup working before? Did something change?

This is the puzzling part. It was working just fine. No changes to the setup, but everything was freshly deployed

bilalshaikh42 commented 2 years ago

Here are logs from attempting to run hsls SN:

REQ> GET: / [/results/6195414bbd45a4fd3bacabf5]
47
DEBUG> num tasks: 6 active tasks: 6
46
DEBUG> using basic authorization
45
DEBUG> validateUserPassword username: biosimulations
44
DEBUG> looking up username: biosimulations
43
DEBUG> user password validated
42
DEBUG> GET_Domain domain: biosimdev/results/6195414bbd45a4fd3bacabf5 bucket: biosimdev
41
INFO> got domain: biosimdev/results/6195414bbd45a4fd3bacabf5
40
INFO> getDomainJson(biosimdev/results/6195414bbd45a4fd3bacabf5, reload=True)
39
DEBUG> got dn_url: http://10.20.4.179:6101 for obj_id: biosimdev/results/6195414bbd45a4fd3bacabf5
38
DEBUG> sending dn req: http://10.20.4.179:6101/domains params: {'domain': 'biosimdev/results/6195414bbd45a4fd3bacabf5'}
37
INFO> http_get('http://10.20.4.179:6101/domains')
36
DEBUG> get_http_client, url: http://10.20.4.179:6101/domains
35
INFO> http_get status: 500 for req: http://10.20.4.179:6101/domains
34
ERROR> request to http://10.20.4.179:6101/domains failed with code: 500

DN

INFO> gotPodIps: ['10.20.4.179']
15
INFO> http_get('http://10.20.4.179:6101/info')
14
REQ> GET: /info [10.20.4.179:6101]
13
INFO RSP> <200> (OK): /info
12
INFO> http_get status: 200 for req: http://10.20.4.179:6101/info
11
INFO> node_info check dn_ids: ['dn-b42db']
10
REQ> GET: /info [10.20.4.179:6101]
9
INFO RSP> <200> (OK): /info
8
INFO> s3sync nothing to update
7
INFO> s3syncCheck no objects to write, sleeping for 1
6
INFO> s3sync nothing to update
5
INFO> s3syncCheck no objects to write, sleeping for 1
bilalshaikh42 commented 2 years ago

And the results of the command itself

hsls -v -H -e https://data.biosimulations.dev -r -u biosimulations -p XXXX
error getting domain: 
HTTPSConnectionPool(host='data.biosimulations.dev', port=443): Max retries exceeded with url: /domains (Caused by ResponseError('too many 500 error responses',))
jreadey commented 2 years ago

I don't see any errors in the DN output. Are you running with more than 1 pod?

bilalshaikh42 commented 2 years ago

This is scaled down to just one pod. I am able to replicate this issue. The SN claims that DN returned a 500, but the DN does not log any errors

bilalshaikh42 commented 2 years ago

Sorry, that is incorrect. The logs were just scrolling to fast. Here is the DN log

INFO> s3sync nothing to update
23
INFO> s3syncCheck no objects to write, sleeping for 1
22
REQ> GET: /domains [biosimdev/results/619541465b88db270f417fbe]
21
INFO> get_metadata_obj: biosimdev/results/619541465b88db270f417fbe bucket: None
20
ERROR> FileClient init: root_dir config not set
19
WARN> HTTPInternalServerError error for biosimdev/results/619541465b88db270f417fbe bucket:biosimdev s3key: results/619541465b88db270f417fbe/.domain.json
18
REQ> GET: /domains [biosimdev/results/619541465b88db270f417fbe]
17
INFO> get_metadata_obj: biosimdev/results/619541465b88db270f417fbe bucket: None
16
ERROR> FileClient init: root_dir config not set
15
WARN> HTTPInternalServerError error for biosimdev/results/619541465b88db270f417fbe bucket:biosimdev s3key: results/619541465b88db270f417fbe/.domain.json
14
INFO> s3sync nothing to update
13
INFO> s3syncCheck no objects to write, sleeping for 1
12
REQ> GET: /domains [biosimdev/results/619541465b88db270f417fbe]
11
INFO> get_metadata_obj: biosimdev/results/619541465b88db270f417fbe bucket: None
10
ERROR> FileClient init: root_dir config not set
9
WARN> HTTPInternalServerError error for biosimdev/results/619541465b88db270f417fbe bucket:biosimdev s3key: results/619541465b88db270f417fbe/.domain.json
8
REQ> GET: /domains [biosimdev/results/619541465b88db270f417fbe]
7
INFO> get_metadata_obj: biosimdev/results/619541465b88db270f417fbe bucket: None
6
ERROR> FileClient init: root_dir config not set
5
WARN> HTTPInternalServerError error for biosimdev/results/619541465b88db270f417fbe bucket:biosimdev s3key: results/619541465b88db270f417fbe/.domain.json
4
INFO> s3sync nothing to update
3
INFO> s3syncCheck no objects to write, sleeping for 1
2
INFO> s3sync nothing to update
1
INFO> s3syncCheck no objects to write, sleeping for 1
jreadey commented 2 years ago

It certainly looks like HSDS is not seeing the AWS_S3_GATEWAY config. Try this... exec into the pod: $ kubectl exec -it <pod_id> -c dn -- bash Then check the AWS settings: # grep -i aws /config/*.yml I'm expecting to see aws_s3_gateway to be null in config.yml but set to the Scality endpoint in override.yml. While you are there, check for environment variable settings: # env | grep -i AWS

bilalshaikh42 commented 2 years ago

Thank you so much for your time on this!

That last command revealed the issue, and I feel quite silly. Our config file was being mounted as config.yaml instead of config.yml 🤦🏾‍♂️. It seems only the latter works. I created a quick pr to prevent this in the future if you feel it's helpful (#108). Of course, feel free to close if you prefer only one option.

We are also manually mounting in an override.yml file. This is probably because I copied it from some template at some point. Should that be mounted in, or is it generated automatically?

jreadey commented 2 years ago

Glad we were able to figure out the problem!

There a few different ways to mount the yml (or yaml) files. I think at one point the deployment examples used the subpath syntax, but currently just doing:

volumeMounts:
          - name: config
            mountPath: "/config"

where the volume points to the config map:

- name: config
        configMap:
          name: hsds-config

And the config map is created with multiple --from-file args:

kubectl create configmap hsds-config --from-file=admin/config/config.yml --from-file=admin/config/override.yml

Hope that helps!