hacky workaround: use locking to quiet race detector problems

seebs commented 1 year ago

So we have a problem which is triggered in part by the race detector, but which is actually deeper, but also possibly rare enough to be politely ignored.

The real underlying issue is that sometimes when we have multiple tests running in CI, multiple instances of the CLI test end up using the same postgres database backing for some of their DAX stuff. We have workarounds for this in some places, but not others.

But the observed symptom of this is that it can cause a trivial race detector issue where we have one call to (*Resource).Lock() and another call to (*Resource).IsLocked() which aren't synchronized in any way, so if the race detector spots this, it complains.

We can suppress that very easily by synchronizing these. That does not solve the other possibly-weird problems, so this may not actually address the issue, but I think it might reduce the rate of sporadic failures significantly, which would give us some time to think about solving the deeper problem.

The underlying design issue is that we're reusing the database name in postgres for testing. This lets us have bounded growth (one database) while leaving the database contents up after a failed test (so we can examine them), then truncating the database during startup if it already exists. Which works fine if only one thing runs at once, which would be true on a laptop, but in CI, it's sometimes not true. A real fix for that is complex and requires some rethinking of how we approach the test stuff, as we don't want unbounded growth, but we also don't want two copies of the test running at once to see each other, and ensuring cleanup after a test failure is surprisingly hard.

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

seebs commented 1 year ago

I'm still not sure whether this will actually help or whether we'll just get different failures now in the same circumstances.

seebs commented 1 year ago

So my thought is, if this passes CI, it might make things better, or if it doesn't, will at least mean that we get different error messages that are more directly on point for "something horribly corrupted in the postgres database", and if that happens, well, we can ticket that and pursue it. But I've seen a couple of failures of this in the race detector tests, and none in anything else, so I'm suspecting this will help.

FeatureBaseDB / featurebase

hacky workaround: use locking to quiet race detector problems #2352