apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.07k stars 445 forks source link

Broken or Flaky test: PermissionsIT #2739

Closed dlmarion closed 2 years ago

dlmarion commented 2 years ago

PermissionsIT test methods are failing after merging #2707

dlmarion commented 2 years ago

This may not be related to #2707 after all, it may be related to the ZooKeeper property changes. The test sets some arbitrary table properties then confirms that the counts match. The test does not fail all the time. The stack trace is:

org.opentest4j.AssertionFailedError: expected: <0> but was: <1>
    at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55)
    at org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62)
    at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:166)
    at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:161)
    at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:628)
    at org.apache.accumulo.test.functional.PermissionsIT.testArbitraryProperty(PermissionsIT.java:877)
    at org.apache.accumulo.test.functional.PermissionsIT.testGrantedTablePermission(PermissionsIT.java:804)
    at org.apache.accumulo.test.functional.PermissionsIT.tablePermissionTest(PermissionsIT.java:641)
...
ctubbsii commented 2 years ago

I looked at several recently flaky tests... and they all seem to be the same type: do something that changes a property in ZK, then check that it was set to verify. I think what's happening is that the new prop store cache lookup is much faster now than with ZooCache previously, which looked up one property at a time. Because the cache lookup is faster, it's likely much faster than the watcher that triggers an update to the cache, and we're more likely to see stale values.

These tests could be improved by looping, waiting for the expected change (relying on timing out to fail the test if the asserted condition never happens, rather than an assert statement), or we can do something to force a sync on the server side request handling when users have a client request to view properties, so the client sees recent edits. I'd probably prefer the latter, so that way it improves the user experience, and we don't have to chase these tests one by one.

dlmarion commented 2 years ago

These tests could be improved by looping, waiting for the expected change (relying on timing out to fail the test if the asserted condition never happens, rather than an assert statement)

This could be made easier if we put the version number into the configuration. Then, the client could loop until a version of the properties come back with a version number greater than it had when it changed the configuration. For example:

long version = client.tableOperations().getProperties(tableName).get(Property.CONFIGURATION_VERSION.getKey());
// set iterator on table
while (client.tableOperations().getProperties(tableName).get(Property.CONFIGURATION_VERSION.getKey() <= version) {
  Thread.sleep(250);
}
// move on
ctubbsii commented 2 years ago

That's one way to do it, without any architecture changes. However, it really only helps for tests, and we'd have to modify each test to do this. There could still be a problem with collisions with this solution, and there are also still situations where you see that it is updated on the one server you asked... but then you perform an operation (like a scan), and the tserver doing that operation still sees the old version.

EdColeman commented 2 years ago

If the tests are setting multiple properties, it would help if the were bundled into an atomic set - that reduces the number of watch notifications that potentially need to be handled.

With the way ZooKeeper events are propagated clients receive notifications asynchronously - they will eventuality be consistent (baring additional changes)

Possible architectural changes for mitigation:

ctubbsii commented 2 years ago

If the tests are setting multiple properties, it would help if the were bundled into an atomic set - that reduces the number of watch notifications that potentially need to be handled.

That will be easier after #2717 is done

With the way ZooKeeper events are propagated clients receive notifications asynchronously - they will eventuality be consistent (baring additional changes)

Possible architectural changes for mitigation:

  • send the properties with the scan request.
  • send a version id of the properties to use (this might require keeping multiple, tagged versions so that a specified version is always available)

It might be nice to pass a generic metadata structure over every RPC call, so we can pass arbitrary state information to servers. We currently only send trace information that way, but we could expand that to include information about the caller's view of the configuration(s). The server could check to make sure that it is always at least as new as that version, and expire any cached items that are older, before proceeding. I wouldn't want to keep multiple versions, though. It should be fine as long as the server's view is at least as up-to-date as the client's. We don't need full isolated configuration... we just need to order client requests that use properties after processing any previous client requests to update properties.

We would also need to have the property setter RPC methods return a version once they are set, so the client knows what version they just set.

  • send a timestamp that requests the properties have the same (or newer) timestamp. You could still end up with conflicts, but they would be more explicitly bound.
dlmarion commented 2 years ago

If I'm remembering the problem correctly from last week, the client modifies some properties and the server watcher has not fired before the client fetches the property to verify that it's been set. Have we considered having ZooPropStore.get(PropCacheKey) call ZooPropStore.readPropsFromZK(PropCacheKey) to confirm that the cached version matches the version in ZK?

ctubbsii commented 2 years ago

If I'm remembering the problem correctly from last week, the client modifies some properties and the server watcher has not fired before the client fetches the property to verify that it's been set. Have we considered having ZooPropStore.get(PropCacheKey) call ZooPropStore.readPropsFromZK(PropCacheKey) to confirm that the cached version matches the version in ZK?

That's basically equivalent to expiring the cache, which forces it to reload from ZK, which is what PR #2740 (and #2741) do.

dlmarion commented 2 years ago

That's basically equivalent to expiring the cache, which forces it to reload from ZK, which is what PR https://github.com/apache/accumulo/pull/2740 (and https://github.com/apache/accumulo/pull/2741) do.

I was just wondering if we should attempt to make the ZK clients (Accumulo server processes) more consistent. I know there is a cost to it, but if the server processes were more consistent then we would not need to invalidate the cache on a client call.

dlmarion commented 2 years ago

I went back and re-read the ZooKeeper Consistency Guarantees. I think the note at the end means that my suggestion wont work. I think it also means that #2740 and #2741 wont work either in the case where there is more than one ZK node. #2740 and #2741 may make the tests work, but it will be because of a side-effect of the number of ZK servers that MAC starts.

EdColeman commented 2 years ago

Does it have an impact that configuration / properties are relatively stable? Once set, they generally do not change often. The tests are rather synthetic that they set and then immediately expect to see the changes. In a system, those changes are going to to take time to propagate and the design should recognize that ambiguity.

dlmarion commented 2 years ago

I agree, I think this is an issue that likely only affects the tests. Can we specify a different ZooPropStore for MAC? Maybe a subclass of ZooPropStore that removes the cached value and reloads it as part of the get() call?

ctubbsii commented 2 years ago

I agree, I think this is an issue that likely only affects the tests. Can we specify a different ZooPropStore for MAC? Maybe a subclass of ZooPropStore that removes the cached value and reloads it as part of the get() call?

I don't think it's a good idea to run tests with a different prop store. That will cause us to lose testing coverage for the code that users will use.

I went back and re-read the ZooKeeper Consistency Guarantees. I think the note at the end means that my suggestion wont work. I think it also means that #2740 and #2741 wont work either in the case where there is more than one ZK node. #2740 and #2741 may make the tests work, but it will be because of a side-effect of the number of ZK servers that MAC starts.

Those aren't just for fixing tests... they are any time the user asks our API for the configuration. This applies to tests, but users can do this also. #2740 isn't trying to solve any kind of global consistency across clients. It's a much more modest solution: just do a better job of returning something at least as recent as any previous setProperty() operations that client might have issued. If we wanted an extra guarantee, we'd also zk.sync before re-reading, but clearing the cache to force it to read from ZK is much more stable than what's happening now.

dlmarion commented 2 years ago

https://github.com/apache/accumulo/pull/2740 isn't trying to solve any kind of global consistency across clients.

Understood.

It's a much more modest solution: just do a better job of returning something at least as recent as any previous setProperty() operations that client might have issued.

I think that will only happen if:

If we wanted an extra guarantee, we'd also zk.sync before re-reading, but clearing the cache to force it to read from ZK is much more stable than what's happening now.

I agree. It should definitely help for the tests and should help in production as config changes are likely rare occurrences.

ctubbsii commented 2 years ago

While not a 100% fix for the underlying issue, #2740 should fix the test instabilities we've seen.