Spanner: Refactor the test fixture.

The existing test fixture is getting unmanageable in both the amount of time it takes to initialize, and the amount of unrelated code it's holding. Also, some decisions that were made at the beginning os Spanner test rewritting might have not been the best, in hindsight, and may be reconsider now.

The general proposal is to have several specialized fixtures, maybe all deriving from a base class.

Let's use this issue to discuss some options before starting any work.

@olavloite FYI (I can't add you as an assignee)

Some aspects that we might want to discuss (that I can think off right now)

Resource deletion approach.
Fixed resources (that we reuse for all tests run) vs. temporary resources.
Test execution ordering.
Frontier in which to split the current fixture.

My thoughts on these:

Resource deletion: Keeping resources after tests are done allows using them for diagnosing purposes when tests fail. But it also means that we need to delete stale resources before starting a new test run, and that if we are not very good at that, we hit quotas. Initializing test runs also takes longer because of all the deletions. I'm happy to reconsider this approach and delete temporary resources on test teardown, or maybe on fixture teardown.

Fixed vs. temporary resources: Thinking particularly of backups, but also of some DB operations. They take (or at least used to) a long time (some are still being skipped, see #1300 ), so if we can avoid creating temporary resources in those cases, that's preferable. The aim should be to keep running times as low as possible.

Test execution ordering: I'm still opposed to executing tests in a certain order, not just because of testing best practices, but also because there's no guarantee that users will execute samples in a given order. We need to make sure that samples work on their own.

Splitting the current fixture: Seems like the instance is a good place to do that, as in, each fixture works with a different DB instance? But this is just intuition.

In addition to the above, I would also like to add the following to the list of discussion points:

Parallel execution of tests

Resource deletion

While it can be useful to keep a resource around after a test run for diagnostic purposes, many of the sample tests are so simple that a failure will normally be easy to diagnose without the resource as well, especially if the single test is quick to run from a development machine. I would say that an approach that prefers deletion of resources after a run is preferable, unless the resource in question has some attributes that make it different from the most common resources, or if we know that it takes a long time to create. In the latter case, it might be more efficient to diagnose a test failure by looking at the resource that was left behind, than by rerunning the test.

Also, the deletion should normally be the responsibility of the code that creates is. So my suggestion would be something like:

A test class that only contains one test, and the test needs a simple resource (e.g. a database with one or two tables), should create a database, test the feature, and then drop the database.
A test class that contains more than one test method, and all the tests need a simple resource (e.g. a database with one or two tables), should create one database for all the test methods. Each test method should use the shared resource and bring it back to the state that it was before the test started. The resource should be deleted when all the tests in the class have executed.
A test that needs a resource that takes a long time to create (e.g. a backup), and that cannot be a fixed resource, should prefer delaying the deletion of the resource until the next test run. The initialization of the test/fixture that creates such a resource, should start by deleting stale resources of the type that it is creating. However, this approach should only be used if a fixed resource is not viable for whatever reason.

Fixed resources

Most types of resources are now quickly created by Cloud Spanner when the resource is small/empty, but backups can still be slow. Tests that need one or more backups should therefore prefer using fixed resources over creating temporary ones. Other resources can be created quickly. Some example times (the below tests use uniquely generated ids for the resources, as Cloud Spanner will otherwise throttle requests to create a new resource with the same name as a recently dropped resource):

Create and drop a single node regional instance: 2 seconds
Create and drop an empty database on an existing single node regional instance: 4-5 seconds
Create and drop a database with two sample tables (Singers, Albums): 4-5 seconds
Create and drop a database with two sample tables and a secondary index: 5-6 seconds
Create and drop a database with two sample tables, a secondary index and 10 test rows (5 per table): 7-9 seconds
Creating and dropping a backup of a database with two sample tables, a secondary index and 10 test rows: 03:29 minutes

Based on the above, I would suggest that test cases should prefer temporary resources for all resource types, except for backups.

Test execution order

I completely agree with the above, the tests should whenever possible be independent of execution order, and independent of all other tests.

Splitting the fixture

Based on all the above, I would suggest that:

Each base fixture should use its own Instance (not database). That instance could be fixed or created/dropped by the fixture.
Each test or test class should create and drop its own temporary database.

The above would make the tests completely independent of each other, which makes it easier to execute them individually (and also to execute them on the emulator).

Parallel execution

The above execution times are (except for extreme circumstances) not affected very much if multiple operations are executed in parallel on different databases. That means that an op-in option for executing (simple) tests in parallel would further reduce the total execution times of the sample tests. Tests that are set up to always create/drop their own temporary database can safely be executed in parallel with other tests that use the same strategy. This parallel strategy also works when executing tests on the emulator, as long as the individual tests do not use features that are not supported on the emulator. Executing parallel operations and transactions in different databases on the emulator is supported.

+1 to chatting about Parallel execution and dropping Test exection order since we are in agreement there.

Fixed resources (shared across test runs):

Agree to keep only backups fixed for now.
Be prepared to consider instances fixed as well (mostly because of quotas and not because of running times). There's a limit of 5 instance creates per minute. Instances might not be fixed, but we might just create one or two per test run.

Resource deletion and Splitting the fixture (mixing these two as there is much overlap)

Agree with not keeping resources after test execution just for diagnostic purposes only (except for very special cases).
On who should delete resources:
- On points 1 and 2: More than looking at test grouping in classes to determine who should delete the resources, I'd look at:
- Possible overlap of resource usage, naming, etc., specially if we are going to make tests run in parallel.
- Possible hits on quota.
- Avoiding test failure because of cleanup failure.
- Avoiding cleanup code duplication.
- In general, avoiding tests that need to set resources to its previous state. If that fails, subsequent tests that relied on it succeeding will fail as well. Also, won't play well parallelization.
- So, and given that we really have one test per class anyway:
- Instances are created on fixture startup and deleted on fixture teardown. (Except for maybe tests that actually test instance creation and deletion).
- Fixtures define how resources are shared, in particular databases. Each test uses the appropiate fixture. For instance:
  - We can have a fixture for queries, that initializes a database with all tables and data that will be expected by query tests. Query tests don't need to create or delete resources. The fixture will delete the database on teardown.
  - We can have a fixture for data modification, which allows tests to create a randomly named database and mark it for deletion. Deletion of all such database will hapen on teardown, with instance deletion.
- Deletion always happens on fixture teardown, and not on test teardown (except if we need to start deleting resources faster because of quotas, but let's get there first).
- Deletion always happens on a best effort basis, if deleting a resource fails, we move on to the next. Tests never fail.
- Either on teardown or startup (possibly on teardown so that we don't delay startup), we also delete stale resources that may exist because of cleanup failed on a previous run. Stale resources are at least a day old, so as not to accidentally delete resources being used right now by other test runs.
- On point 3, I don't understand the rationale here. If the creation takes a long time, but we need to create it anyway, why would that makes us deffer deletion? Did you mean if deletion takes a long time? Then I'd agree that deleting the resource in the test should be avoided, but as per my point before, we would delete it on fixture teardown.
Agree on having an instance (not database, that was a typo I think) per fixture.
As per my points above, I think that, when you see a class holding different tests, I just see a different fixture (see my proposal for query tests for instance). In general, I'd like to keep one test per file (or at the most, tests for one sample per file). This is for repo uniformity, and mostly becuase it makes it easier to delete one sample and its tests.

Parallel execution

I'm fine with allowing parallel execution. If I remember correctly we had some resource overlapped, which should be taken care of once we've agreed on a new approach here, and also because we were hitting some quotas (requests per minute kind of quotas). We might need to fine tune it but that's fine. For reference: https://xunit.net/docs/running-tests-in-parallel

(@olavloite now I could add you on the assignee list as well, weird why I couldn't at first)

@amanda-tarafa Thanks for the elaborate points. I think we agree on most (all?) of the above. I'll take a detailed look Thursday or Friday this week. The first half of this week is a little bit hectic.

Great! No rush at all on my part. This is a nice to have (do) but tests are running pretty consistently so it's fine that we take time to do all this.

I totally agree with almost all of the above, so I'll just focus on a couple of worries that I have regarding a couple of points (plus one point where my rationale was indeed not really clear):

... I think that, when you see a class holding different tests, I just see a different fixture ...

Yeah, I'm a little bit too used to the Java test environment, where there is no separate fixture, and the setup is part of the test class itself. I think in most cases we mean the same, but my terminology is a little off. The biggest difference is your idea on a shared query fixture, which I really like. I have a couple of small worries/exceptions (see below).

We can have a fixture for queries, that initializes a database with all tables and data that will be expected by query tests. Query tests don't need to create or delete resources. The fixture will delete the database on teardown.

I think that this will work very well for many of the query samples, but probably not all. There are two (maybe three) categories that I worry about:

Some query samples might need to change some data to show a certain effect. One example could for example show that a read-only transaction would continue to read a certain version of the data, even though a separate transaction has updated the data.
Some query samples will show new features that are not yet supported on the emulator. One common issue has for example been the introduction of new data types. The emulator will normally only start supporting that new data type after 3-6 months. If we have one query fixture for all samples, the query sample tests would not be runnable on the emulator when at least one of the tests use a data type (or possibly also other features) that is not supported on the emulator.
(This one's theorectial, I don't have any specific examples in mind at the moment): Some DDL statements can be slow to execute. It would defy some of the purpose of this refactor if we were to include such statements in the standard query fixture, as it would make all query sample tests slow to execute. Previously, creating a unique index could take a long time, even for an empty table. That is no longer the case, but there might be other features that have the same problem.

All the above can easily be worked around by using the 'data modification' fixture instead of the query fixture for the specific sample test that would include any of the above. So my suggestion would be:

Use the standard query fixture whenever possible.
Use a fixture that can create a temporary database for the specific test.

Deletion always happens on fixture teardown, and not on test teardown (except if we need to start deleting resources faster because of quotas, but let's get there first).

I agree that we should try to stick with the strategy that the deletion happens on fixture teardown (i.e. on instance deletion) instead of after each test, as long as that is possible. We might however be closer to that point than we think. The current samples directory contains 79 sample files, and there is a hard limit of max 100 databases per instance on Cloud Spanner. Assuming that many of the samples can use the shared query fixture, we should be good for now, but it is something that we should keep in mind if the number of (non-query) samples continues to grow.

On point 3, I don't understand the rationale here. If the creation takes a long time, but we need to create it anyway, why would that makes us deffer deletion? Did you mean if deletion takes a long time? Then I'd agree that deleting the resource in the test should be avoided, but as per my point before, we would delete it on fixture teardown.

Sorry, I should have included my rationale in that point. The reasoning is that we generally will not keep resources around for inspection in case of test failure, as most resources are quick and easy to create anyways. That means that debugging a failed test is normally easy to do by just running it again. That is not necessarily true for tests that create for example backups, and for those cases it might be useful to keep the resource around after the test has executed, and only delete it at a later time. This is however not something that I feel very strongly about either way, so I'm also perfectly happy to try to keep the strategy as clean as possible, and use the same deletion strategy for all types of resources.

One additional point regarding the deletion of backups: Cloud Spanner does not allow deleting instances that contain one or more backups. The fixture teardown must therefore first delete all backups that are stored on an instance before it can delete the instance.

Use the standard query fixture whenever possible.

Use a fixture that can create a temporary database for the specific test.

Yes, totally agree. We'll always need to support exceptions for the general case.

I agree that we should try to stick with the strategy that the deletion happens on fixture teardown ... Assuming that many of the samples can use the shared query fixture, we should be good for now, but it is something that we should keep in mind if the number of (non-query) samples continues to grow.

Yes, I'm also worried that we might be almost there. I wouldn't like to have to rewrite things again in a few months. Since the limit is per instance, I was wondering if we could just group some other tests in a similar way as will do queries, even if it's not for reusing one database, but just for avoiding the limit. For instance, we can group all the admin tests in one fixture, etc. But I'm ultimately fine with deleting databas on a per test-basis if we need to, or if we thinkg we'll need to in the short/medium term. It'll make tests slightly slower probably, but hey...

The reasoning is that we generally will not keep resources around for inspection in case of test failure, as most resources are quick and easy to create anyways.

Ah yes, I see. I think I'm now leaning towards a cleaner test strategy here and just try and delete everything at teardown. But also, I think that at teardown, we should continue to attempt to delete stale resources, which will allow us to always delete on a best effort basis, and not fail the tests if cleanup failed. So, I think it'll be easier to switch from one strategy to the other if we need to, basically, we will start by doing DeleteOwnBackups(); DeleteStaleBackups(); and we can just remove this firs call if we need to.

The fixture teardown must therefore first delete all backups that are stored on an instance before it can delete the instance.

Yes, I think we do this right now.

GoogleCloudPlatform / dotnet-docs-samples