getmoto / moto

A library that allows you to easily mock out tests based on AWS infrastructure.
http://docs.getmoto.org/en/latest/
Apache License 2.0
7.49k stars 1.99k forks source link

Unable to easily mock Athena result CSV #7702

Open dmarra opened 1 month ago

dmarra commented 1 month ago

We are able to mock Athena query results via:

   resp = requests.post(
        "http://motoapi.amazonaws.com/moto-api/static/athena/query-results",
        json=query_results,
    )
    assert resp.status_code == 201

But there is no way to mock the output of get_execution_results. this poses a few problems:

  1. unable to have any control over request status (in cases where you want to test that retries are happening for instance)
  2. unable to have any control over the output location

I am more interested in the latter. This would be a lot more useful if we could somehow assign results via a CSV, and have the query-results mocked from that. But in the simplest case, being able to control that datapoint to point to a file we have put to a mock bucket would go a long way in the case where the code under test needs to access the file directly.

Here is an example of how I had to hack this into my tests:

from moto.athena import athena_backends
from moto.athena.models import Execution
from moto.core import DEFAULT_ACCOUNT_ID

@pytest.fixture
@mock_aws
def mock_athena_results():
    # place the mocked athena results in "s3"
    s3 = boto3.client("s3", region_name=os.getenv("AWS_DEFAULT_REGION"))
    bucket_name = 'test-bucket'
    key = f"query_results/output.csv"

    with open('some/csv/file.csv', "r") as f:
        contents = f.read()
        s3.put_object(Bucket=bucket_name, Key=key, Body=contents)

    # Hack moto to mock the execution results to have the correct path to our csv results
    exec = Execution(
        query="SELECT * FROM notused",
        config={
            "OutputLocation": f"s3://test-bucket/query_results/output.csv"
        },
        context={
            'Database': 'default'
        },
        workgroup={},
        execution_parameters={}
    )
    # moto tries to mock the file location to _look_ correct, but we need it to be an actual result
    exec.config['OutputLocation'] = f"s3://test-bucket/query_results/output.csv"

    # This is a hack to get moto to return the correct execution object by having a factory that
    # pretends to be a list of executions
    class MockExecutionFactory:
        def __get__(self, instance, owner):
            return [
                exec
            ]

        def __set__(self, instance, value):
            pass

        def __setitem__(self, key, value):
            pass

        def __getitem__(self, key):
            return exec

    athena_backends[DEFAULT_ACCOUNT_ID]["us-east-1"].executions = MockExecutionFactory()
    return

The above works, but its pretty janky. I suppose this is more of a feature request than anything. Or even better yet, having a way to control execution id (not with seeding, thats still too unpredictable and fragile) such that it can be explicitly set.

bblommers commented 1 month ago

Hi @dmarra, thanks for raising this.

My assumption is that the application takes whatever OutputLocation is given, downloads the result from S3 and does whatever it needs to do with the result.

Would it work if Moto would actually write the configured query results to S3? Then the actual OutputLocation doesn't matter, and the only test setup is to configure the expected results via the static/athena/query-results endpoint.

dmarra commented 1 month ago

That could work! The only issue with that is that the Athena output JSON is.... perhaps a bit overengineered. It's a pain to mock many rows of data in that format. But that being said, if this is the easiest way to get the enhancement in, then I think its way better than nothing. Makes sense too, since it works with the current design.

dmarra commented 1 month ago

To clarify:

  1. It would have to respect the bucket given as output (but the file name could be the auto-generated execution ID)
  2. The output format is CSV. I don't think Athena has a way to change that, so it should be fine to just always have the output in CSV format in the bucket