Unify DAG creation/database cleaning fixtures for testing

AetherUnbound commented 1 year ago

Description

We have a few places in our catalog tests with very similar logic for creating the DAG necessary for running some tests. Examples:

https://github.com/WordPress/openverse/blob/9056029e6c354fe2dbf5b0ad49d60648bd380d27/catalog/tests/dags/common/sensors/test_single_run_external_dags_sensor.py#L23-L34

https://github.com/WordPress/openverse/blob/9056029e6c354fe2dbf5b0ad49d60648bd380d27/catalog/tests/dags/data_refresh/test_dag_factory.py#L22-L25

https://github.com/WordPress/openverse/blob/9056029e6c354fe2dbf5b0ad49d60648bd380d27/catalog/tests/dags/common/test_ingestion_server.py#L21-L31

https://github.com/WordPress/openverse/blob/9056029e6c354fe2dbf5b0ad49d60648bd380d27/catalog/tests/dags/common/sensors/test_utils.py#L18-L21

https://github.com/WordPress/openverse/blob/9056029e6c354fe2dbf5b0ad49d60648bd380d27/catalog/tests/dags/providers/test_provider_dag_factory.py#L27-L37

Although there are some slight differences, it would be ideal if we could find a way to combine these into a single fixture that's used within all the affected tests, rather than redefining the fixture for every test.

Additional context

Came up in discussing https://github.com/WordPress/openverse/pull/2209#discussion_r1232622950

ngken0995 commented 10 months ago

@AetherUnbound Can I be assigned to this issue?

ngken0995 commented 10 months ago

@AetherUnbound @sarayourfriend Here is a code I tried to implement in catalog/test/conftest/py.

@pytest.fixture()
def get_test_dag_id():
    return ""

@pytest.fixture()
def get_test_pool():
    return ""
@pytest.fixture()
def isTaskInstance():
    return False

@pytest.fixture()
def isPool():
    return False

@pytest.fixture(autouse=True)
def clean_db(get_test_dag_id, get_test_pool, isTaskInstance, isPool):
    with create_session() as session:
        # synchronize_session='fetch' required here to refresh models
        # https://stackoverflow.com/a/51222378 CC BY-SA 4.0
        session.query(DagRun).filter(DagRun.dag_id.startswith(get_test_dag_id)).delete(
            synchronize_session="fetch"
        )
        if isTaskInstance:
            session.query(TaskInstance).filter(
                TaskInstance.dag_id.startswith(get_test_dag_id)
            ).delete(synchronize_session="fetch")
        if isPool:
            session.query(Pool).filter(id == get_test_pool).delete()

The goal was to create pytest.fixture on each individual test to override get_test_dag_id, get_test_pool, isTaskInstance, and isPool global values. I thought clean_db will re-run again each time another test ran which I'm not sure if it did. It fails when I run just catalog/test. Am I going on the right direction? How do I make sure clean_db will trigger every time a new test is being ran?

sarayourfriend commented 10 months ago

As far as getting clean_db to run for each test, what you've got make sense to me. The only thing is if you want it to run after the test, you can yield at the start of the function. Pytest runs fixture functions before the test, and if they yield, it'll call .next on the generator after the test is down, sort of like a tear-down step. Not sure if that is the issue though.

I also don't know for sure if it's possible to create local overrides for fixtures the way you're proposing. I think the right approach for your idea (which sounds good to me), is to use the request fixture to get information from the test's module:

https://docs.pytest.org/en/7.1.x/how-to/fixtures.html#fixtures-can-introspect-the-requesting-test-context

The idea would be to define test_dag_id, test_pool, etc, as variables in the module. They can be functions too, if you want. Then retrieve them using something similar to the code in the example in the docs:

    test_dag_id = getattr(request.module, "test_dag_id", None)
    # probably raise an exception if `test_dag_id` is None?

Also, if clean_db only needs to run for a subset of tests, like a group of tests that are all in a subdirectory (even if nested), then it should go into a conftest.py in that directory, so that it doesn't run for tests that don't need it.

ngken0995 commented 10 months ago

    def do_execute(self, cursor, statement, parameters, context=None):
>       cursor.execute(statement, parameters)
E       sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'function'
E       [SQL: DELETE FROM dag_run WHERE dag_run.dag_id = %(dag_id_1)s]
E       [parameters: {'dag_id_1': <function test_dag_id at 0x7f94651632e0>}]
E       (Background on this error at: https://sqlalche.me/e/14/f405)

/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/default.py:736: ProgrammingError

I tried using the request fixture and there is an error when calling getattr. I have created a pull request using an override method.(link).

WordPress / openverse

Unify DAG creation/database cleaning fixtures for testing #2622

Description

Additional context