fabric8io / openshift-elasticsearch-plugin

Apache License 2.0
27 stars 21 forks source link

bug 1705589. Reduce number of calls handling a request and cache cont… #179

Closed jcantrill closed 5 years ago

jcantrill commented 5 years ago

…exts

Mitigates https://bugzilla.redhat.com/show_bug.cgi?id=1705589

by moving to bulk requests and introducing caching for syncing

jcantrill commented 5 years ago

where is cache load called?

Here on cache miss: https://github.com/fabric8io/openshift-elasticsearch-plugin/pull/179/files#diff-bb75f44a950c64b663055d3c78b729c5R109

richm commented 5 years ago

where is cache load called?

Here on cache miss: https://github.com/fabric8io/openshift-elasticsearch-plugin/pull/179/files#diff-bb75f44a950c64b663055d3c78b729c5R109

That's the definition of the load method but I do not see where it is called. I'm mainly trying to figure out if it is called in a thread safe way.

jcantrill commented 5 years ago

where is cache load called?

Here on cache miss: https://github.com/fabric8io/openshift-elasticsearch-plugin/pull/179/files#diff-bb75f44a950c64b663055d3c78b729c5R109

That's the definition of the load method but I do not see where it is called. I'm mainly trying to figure out if it is called in a thread safe way.

I defer to the guava documentation that claims it is threadsafe: https://google.github.io/guava/releases/22.0/api/docs/com/google/common/cache/LoadingCache.html

This cache would live on each node so presumably there is a chance of a race condition if your requests get bounced between nodes. The actually write, however, uses a backoff logic. I would expect permissions to coalesce to correctness

richm commented 5 years ago

Other than the aforementioned NPE problem, I don't see anything obviously wrong. My main concern is that this is a big change and an architectural change, introduced into a stable product branch, which could cause regressions due to unforeseen consequences.

On the other hand, this is really the only way to solve this problem, by using a cache.

How can we test the heck out of this thing, in a high traffic, multi-node deployment, so we can at least have some confidence that it will actually work in production without causing more problems/security issues?

jcantrill commented 5 years ago

Other than the aforementioned NPE problem, I don't see anything obviously wrong. My main concern is that this is a big change and an architectural change, introduced into a stable product branch, which could cause regressions due to unforeseen consequences.

On the other hand, this is really the only way to solve this problem, by using a cache.

How can we test the heck out of this thing, in a high traffic, multi-node deployment, so we can at least have some confidence that it will actually work in production without causing more problems/security issues?

I believe the only way is to have QE test an early version of the image before merge using a similar arrangement that exposed the issue and regression test

Gallardot commented 5 years ago

@jcantrill

https://bugzilla.redhat.com/show_bug.cgi?id=1644008

In the non-ops kibana, they either get: Courier Fetch: unhandled courier request error: [security_exception] no permissions for indices:data/read/mget Or a gateway timeout with similar/same message as:

Problem solved? If not, I think I can offer some clues. We also have the same problem.

18_46_25__06_05_2019

I don't think reduce number of calls handling a request with apiserver can solve this problem. I think the problem is how and when to sync openshift ACL to ES searchguard.

173 this's my solution.

jcantrill commented 5 years ago

@jcantrill

https://bugzilla.redhat.com/show_bug.cgi?id=1644008

In the non-ops kibana, they either get: Courier Fetch: unhandled courier request error: [security_exception] no permissions for indices:data/read/mget Or a gateway timeout with similar/same message as:

The permissions issue may simply be a matter of expanding the set given to non ops users.

I don't think reduce number of calls handling a request with apiserver can solve this problem. I think the problem is how and when to sync openshift ACL to ES searchguard.

This change does not really do much to reduce apiserver calls but uses mgets, search, and bulk inserts to reduce the number of calls to and from ES; this is where we experienced and noted issues. For instance, the users who exposed this problem each had access to 300+ projects and we would make probably that plus another 20 calls to set up a single user. I hope you might imagine the problems this would cause when more then one user tries to access Kibana from an already loaded ES instance.

173 this's my solution.

I'm not sure what part of #173 you are suggesting here or where the change is. Is it to sync the ACLs after we formulate the context? Is it to put in a bogus password?

Gallardot commented 5 years ago

@jcantrill

I had test this PR on my test environment.

Version
OpenShift Master:v3.9.0+ba7faec-1
Kubernetes Master:v1.9.1+a0ce1bc657
OpenShift Web Console:v3.9.0+b600d46-dirty

Images base on  openshift/origin-logging-elasticsearch5:v3.11.0

There is nothing wrong with this PR

I had also review this PR. But I have some questions:

If we cached the user's project,Why we need sync ACL to searchguard on every es query.I think it's because our process needs to do that.

1.  load ALL ACL from searchguard.
2.  load one user's ALC from cached or openshift.
3.  update one user's ACL on memory
4.  delete ALL expire ACL on memory.
5.  write the ACL from memory to searchguard.

After this, searchguard will read the ACL from ES to perform permission validation. All processes will be executed on one ES.

Imagine a concurrent scenario:

User XX accesses on ES node X.
User YY accesses on ES node Y.

The request of user XX completed steps 1-5, and then the request of user YY completed steps 1-5. The operation of user YY may replace the operation of user XX. Then user XX will report an error when it checks permissions.

I think ACL synchronization on a single ES master node is a better choice.

jcantrill commented 5 years ago

If we cached the user's project,Why we need sync ACL to searchguard on every es query.I think it's because our process needs to do that.

This PR introduces caching which is only loads on cache miss and causes a sync. The cache will expire after a given time so it will not sync on every ES query for a given user. We expire the cache because a user's project list can change.

1.  load ALL ACL from searchguard.
2.  load one user's ALC from cached or openshift.
3.  update one user's ACL on memory
4.  delete ALL expire ACL on memory.
5.  write the ACL from memory to searchguard.

This is basically the workflow with subtle differences. The ACL's are only loaded if we determine it is necessary to update them for a given user.

After this, searchguard will read the ACL from ES to perform permission validation.

Searchguard only loads the ACLs from ES when we execute a 'ConfigUpdateRequest' which is when we determine it is necessary to write permissions for a given user.

All processes will be executed on one ES.

Imagine a concurrent scenario:

User XX accesses on ES node X.
User YY accesses on ES node Y.

The request of user XX completed steps 1-5, and then the request of user YY completed steps 1-5. The operation of user YY may replace the operation of user XX. Then user XX will report an error when it checks permissions.

I recognize their is a potential race condition here but the permissions are reloaded and merged anytime we may need to potentially re-write. There is logic in place to address updates that relies on document version though to be fair, there does need to be multi-node, multi-user testing. I can not recall if the version is only relevant when doing an 'update' or if it is also significant with a 'replace'. Assuming that works correctly then I expect updates from multiple nodes would be no different then the current implementation. The difference in this PR is the number of 'write' calls and the expiration logic.

I think ACL synchronization on a single ES master node is a better choice.

I agree but we have no way of enforcing a single node performing the synchronization unless we had some kind of clustered cache of user context's. We are not interested in adding that complicated dependency as ES is already a clustered technology.

jcantrill commented 5 years ago

[merge]