HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
127 stars 52 forks source link

Security Vulnerabilities #147

Closed bilalshaikh42 closed 1 year ago

bilalshaikh42 commented 2 years ago

The following security issues should be adressed:

bilalshaikh42 commented 2 years ago

@jreadey https://github.com/HDFGroup/hsds/security/code-scanning/1, https://github.com/HDFGroup/hsds/security/code-scanning/4 and https://github.com/HDFGroup/hsds/security/code-scanning/6 seem like they may be serious security issues. Thoughts?

For https://github.com/HDFGroup/hsds/security/code-scanning/5 this is hashing to determine which node needs to handle a request correct? in that case its okay to use a weak hashing algorithm there and we can mark it as a false positive

jreadey commented 2 years ago

@bilalshaikh42, The https://github.com/HDFGroup/hsds/security/code-scanning/1 issue comes from using eval to do sql-like queries over datasets. The user input runs through the checkQuery function (https://github.com/HDFGroup/hsds/blob/master/hsds/util/chunkUtil.py#L1032) before it gets executed. Do you know if there's an approved way for the input to be sanitized? Example query would be like: https://github.com/HDFGroup/hsds/blob/master/tests/integ/query_test.py#L192: "(open > 3000) & (open < 3100)". Other than parens, >, <, ==, !=, the only text should be field names (like "open"), numeric values ("3000"), or string constants (e.g. "hello world").

A fix for this issue: https://github.com/HDFGroup/hsds/security/code-scanning/4, would be problematic since it would break existing username/passwords. This path is only invoked when the password_salt config is set. If we document that this should only be used within secure environments (say a private network), will that allow it to be marked as false positive?

This issue: https://github.com/HDFGroup/hsds/security/code-scanning/6, come up in returning a list of domain names matching a given pattern. The vulnerability is that some patterns can be computationally expensive to compute and be used in a ddos attack. I'd be fine with using python glob matching, but I think the glob package also uses re.compile, so I'm not sure if that would address the vulnerability or not.

You are correct that https://github.com/HDFGroup/hsds/security/code-scanning/5 is used to determine which DN node should be used for a given uuid. Since the uuid is not user defined, that should be safe.

bilalshaikh42 commented 2 years ago

@bilalshaikh42, The Code injection issue comes from using eval to do sql-like queries over datasets. The user input runs through the checkQuery function (https://github.com/HDFGroup/hsds/blob/master/hsds/util/chunkUtil.py#L1032) before it gets executed. Do you know if there's an approved way for the input to be sanitized? Example query would be like: https://github.com/HDFGroup/hsds/blob/master/tests/integ/query_test.py#L192: "(open > 3000) & (open < 3100)". Other than parens, >, <, ==, !=, the only text should be field names (like "open"), numeric values ("3000"), or string constants (e.g. "hello world").

I am not that familiar with this area / a security expert, so I'm not sure. The recommendation that GitHub gave was to find a way to parse the input in some method directly rather than calling eval. Maybe if the input is well defined enough that would be an option? Or perhaps some custom sanitization function with a good number of tests might be the best approach?

A fix for this issue: Use of a broken or weak cryptographic hashing algorithm on sensitive data, would be problematic since it would break existing username/passwords. This path is only invoked when the password_salt config is set. If we document that this should only be used within secure environments (say a private network), will that allow it to be marked as false positive?

In general, you can simply dismiss the warnings in the security tab, without any particular checks/criteria.

I wouldn't say that is a false positive since it is indeed a vulnerability. How to address it is up to you, either documenting that using alternate methods of authentication such as AWS/keycloak is highly recommended since the default method is unsecured or having some backward compatible flag available for the next major release (breaking change).

I do think at the very least there should be a way to opt into a fixed method in future versions since there is a possibility this setup will be used and exposed to the internet by at least some users

This issue: Regular expression injection, come up in returning a list of domain names matching a given pattern. The vulnerability is that some patterns can be computationally expensive to compute and be used in a ddos attack. I'd be fine with using python glob matching, but I think the glob package also uses re.compile, so I'm not sure if that would address the vulnerability or not.

Simply passing the query through re.escape before the match should be the only thing needed

You are correct that Use of a broken or weak cryptographic hashing algorithm on sensitive data is used to determine which DN node should be used for a given uuid. Since the uuid is not user defined, that should be safe.

Marked this done

jreadey commented 2 years ago

In this commit: https://github.com/HDFGroup/hsds/commit/76519798b2b6beda87e010367de52629a0ab1e01, the BooleanParser ( https://github.com/HDFGroup/hsds/blob/master/hsds/util/boolparser.py) is used to evaluate the SQL-like expression.

So rather than taking any user input and directly running eval on it, it's fed to the BooleanParser class that creates an expression tree for a given input. e.g.: "(x > 42) & (y < 99)", it will parse the expression to insure it's composed of just dtype field names, constants, logical operators, and parens and that it's syntactically correct (e.g. balanced parens). Then the output of BooleanParser is used in the numpy.where method to query the chunk.

jreadey commented 2 years ago

The https://github.com/HDFGroup/hsds/security/code-scanning/6 issue should be resolved in this commit: https://github.com/HDFGroup/hsds/commit/c2c3e7e6ef8cbf03339f7bb474599af1da9cdb0d. Rather than using regualr expression, there's a custom globparser matcher here: https://github.com/HDFGroup/hsds/blob/master/hsds/util/globparser.py. This will match domain names using a subset of the linux bash syntax: https://linuxhint.com/bash_globbing_tutorial/. Currently globparser supports: , ?, and [] pattern matchers. One more restriction: only one can be used in a query. This is to limit the computational complexity in matching the pattern over large strings.

jreadey commented 2 years ago

These should all be resolved now.

jreadey commented 1 year ago

Closing issue