JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.21k stars 1.16k forks source link

Define and support vertex and edge visibility via native HBase visibility #493

Open jerryjch opened 6 years ago

jerryjch commented 6 years ago

JanusGraph has preliminarily defined 'visibility'

    /**
     * Returns true if this storage backend supports entry-level visibility by attaching a visibility or authentication
     * token to each column-value entry in the data store and limited retrievals to "visible" entries.
     *
     * @return
     */
    boolean hasVisibility();
public enum EntryMetaData {

    TTL(Integer.class, false, data -> data instanceof Integer && ((Integer) data) >= 0L),
    VISIBILITY(String.class, true, data -> data instanceof String && StringEncoding.isAsciiString((String) data)),
    TIMESTAMP(Long.class, false, data -> data instanceof Long);

But nothing has been further done and supported yet.

Here we will try to define and support vertex and edge visibility via native HBase cell level visibility.

jahhulbert-ccri commented 6 years ago

Hi @jerryjch I'd love to test it out when something is available...let me know how I can help.

jerryjch commented 6 years ago

I am thinking to start the work. The following is preliminary and high level specification.

More?

jahhulbert-ccri commented 6 years ago

Hi @jerryjch this is awesome. One initial thought is having an "authorization provider" interface that provides a hook to inject the current Scan authorizations for HBase (or another system like Accumulo) by calling the method Scan.setAuthorizations() or Get.setAuthorizations()

A common setup I have encountered is:

In that use case you'd need some method of injecting the user vis per request into the HBase scan (historically I've seen some Spring security mechanisms at use here)

It would be nice if there were user identities mapped all the way to HBase but I often times don't see that. However, we often times have requirements that augment the visibilities based on geography of requests or necessity to intersect visibilites based on "proxied" requests (e.g. servers making queries on behalf of users or users transiting certain proxy servers). Similarly, everyone always wants to be able to turn off visibilties on the fly in the User Interface it seems.

What have you seen in your experience? If there were a default implementation that is a pass-thru it would provide a hook for a runtime implementation in some manner via client config. Hopefully those examples help motivate what I'm thinking about. I can share some examples if needed and definitely I'd love to kick the tires a little and help out with any review! I'll sharing this with some coworkers to see if there are other use cases in input.

jerryjch commented 6 years ago

Hi, @jahhulbert-ccri

Thanks for the feedback! All good points.

We will take the janusgraph server (gremlin server) as an example (a proxy server example and an application example) to illustrate the points your raised. (This does not mean gremlin server currently supports any of these.)

Case 1:

When a user (user1) sends a request to the gremlin server, the gremlin server will authenticates 'user1'. Then when the gremlin server sends requests to the backend, it will use/forward 'user1' as the authid, basically proxy/impersonating 'user1'. Since 'user1' has a predefined 'authorization' (one or more visibility labels which 'user1' is allowed to see), the result returned will be the subset of graph assets within the 'authorization'.

Note that in this case, the authorization mappings/policies are stored in HBase and can be manipulated via hbase shell or API. This is a straightforward case where user identities are mapped all the way to HBase, and a user gets what it is authorized to see.

Case 2:

When a user (user1) sends a request to the gremlin server, the gremlin server will authenticates 'user1'. Then based on some authorization mappings/policies from an 'external source', the gremlin server set the 'authorization' on the individual query before sending it to the backend. In this case, the 'authid' of the gremlin server is used against the backend. But the returned result is limited by the 'authorization' set on the query. The 'authid' of the gremlin server is more like a 'superuser'.

Case 1 can be supported by the preliminary specification I outlined in the early comment. Case 2 can be support as well with some additional API in JanusGraph. But an external source of authorization mappings/policies need to be able to plug into JanusGraph.

I hope I understand your points and requirements well.

ammaroveold commented 5 years ago

hi all. is there any update on this issue. i like to help but i still need more time to understand the framework

jahhulbert-ccri commented 5 years ago

It looks like case 1 is fairly straightforward since the 'user identity' essentially is mapped all the way through from the front-end like a proxy server all the way to the backend.

Case 2 is definitely more complex but also more common in the RDBMS and big-table (accumulo, hbase) systems that I use. In this case the gremlin server (which is basically an application server) usually has some sort of read-only 'privileges' to the database and or tables and then has a set of authorizations. As you mentioned, the end-user authorizations provided by the third party 'authorization service' must be provided and usually they are intersected with the applications authorizations. This intersection is important to note...and it's the reason for supporting "scan-time" authorizations. I actually have worked on a few of these and can provide examples.

In the geomesa system we delegate all this authorization work to HBase and Accumulo via visibilities. Accumulo has a trivial example in their doc. As you can see each "cell" of data in HBase or "key-value" in Accumulo has an visibility expression that is set. The [HBase API] (http://archive.cloudera.com/cdh5/cdh/5/hbase-0.98.6-cdh5.2.1/book/hbase.visibility.labels.html) basically allows you to set the visibilities at scan time:

Get#setAuthorizations(new Authorizations(String,...));
Scan#setAuthorizations(new Authorizations(String,...));

Note that servers in HBase throw an exception if you try to set an auth you aren't assigned on the back end. So you can't cheat. However, what this allows you to do is selectively downgrade your visibilities per scan which is essentially what Case 2 is.

Most of the use cases are for enforcement of ABAC solutions. There are some good examples in health care, financial, defense, and network security applications.

  1. Policies based on the source destination of queries. Enforce visibility filtering based on traffic coming from a certain proxy server source.

A health care example in Janus Graph might be something that contains patient data in the system. Let's say PII (social security, address, name) mixed with medical data (patient has lab result X) and insurance data (patient has insurance Y). You can have a node in the graph with something like:

[paitient 1 name  ]   Visibility Expression:  (billing | audit | doctor | nurse)
[paitient 1 SSN    ]   Visibility Expression:  (billing | audit)
[patient 1 test ordered]   Visibility Expression: (billing | audit) | ((doctor | nurse) & departmentX)
[patient 1 test result node]   Visibility Expression: (doctor | nurse) & departmentX

[paitient 2 name  ]   Visibility Expression:  (billing | audit | doctor | nurse)
[paitient 2 SSN    ]   Visibility Expression:  (billing | audit)
[patient 2 test ordered]   Visibility Expression: (billing | audit) | ((doctor | nurse) & departmentY)
[patient 2 test result node]   Visibility Expression: (doctor | nurse) & departmentY

Note here the expressions are enforcing a Need to Know principle for the department X vs Y as well as billing vs doctor/nurse. The SSN is information that the doctor doesn't need to know. The sensitive test result is something the billing never needs to know they only need to know the test was ordered for billing. Similarly, a nurse from the dermatology department does not need to necessarily know a test result for paternity of a child which is sensitive information.

Financials have similar issues. They often need to firewall financial information between departments. A label or visibility expression in a graph allows certain elements of a graph to be excluded to meet regulations.

The way this fits in with case 2 is that the hbase or RDBMS username that is logging in is an application name like "billing software" or "Nursing station Y" and the application has a max set of authorizations to access data that is then intersected at the application level by querying the third party service. It's a little easier model than pushing the identity all the way to the storage level...though pushing the identity all the way through is arguably a stronger security model. In practice though it makes implementation much harder because you have to provision user accounts in multiple places.

You sometimes want to have geo-based filtering too though which can't be captured in database logic very easily but can be done in scan level filtering.

Hoping that helps!