ONSdigital / dp-data-pipelines

Pipeline specific python scripts and tooling for automated website data ingress.
MIT License
1 stars 0 forks source link

implement upload client in pipeline #111

Open mikeAdamss opened 5 months ago

mikeAdamss commented 5 months ago

What is this

We need to add the block of logic to the pipeline that uses the upload client upload the csv and any supplementary distributions to the upload service.

This will not work yet as platform and auth needs to be sorted out. The task is to get the logic in place such that it should work when these things are in place and to bolt down the behaviour with acceptance tests.

What to do

There is a client for this in dp-python-tools: https://github.com/ONSdigital/dp-python-tools/tree/develop/dpytools/http.

You'll need to make some assumptions to do this:

that should be ebough to put the logic in place, you're aiming for something (super roughly) like this:


# note - all faillible steps in their own try catch please

upload_client: UploadClient = get_upload_client()
upload_bucket = os.environ[UPLOAD_SERVICE_S3_BUCKET]
florence_token = get_florence_token()

# for the csv
upload_client.upload_csv(
    <path to csv>,
    upload_bucket,
    florence_token
)

# pseduo code loop
for supplementary_distribution in supplementary_distributions:

    # look at the file extension
    # call the upload client with the appropriate method

......but...but....but, how do I confirm its working?????

We wrote some acceptance test steps that allow you to capture the outgoing http requests from the pipeline. See these steps here: https://github.com/ONSdigital/dp-data-pipelines/blob/sandbox/features/temporary.feature

We need to update the acceptance tests in dataset_ingress_v1.feature to confirm the upload client is making the expected http posts.

so roughly (you'll likely need to finagle logic a litte) it becomes something like:

Given a temporary source directory of files
        | file                   |  fixture                           |
        | data.xml               |  esa2010_test_data.xml             |
    And a dataset id of 'valid'
    And v1_data_ingress starts using the temporary source directory
    Then the pipeline should generate no errors
    And I read the csv output 'data.csv'
    And the csv output should have '9744' rows
    And the csv output has the columns
          | ID | Test | Name xml:lang |
    And I read the metadata output 'metadata.json'
    And the metadata should match 'fixtures/correct_metadata.json'
    And the backend receives a request to "/upload-new"
    And the json payload received should match "fixtures/whatever-this-needs-to-be.json"
    And the headers received should match
        | key            | value      |
        | something      | expected   |
        | something else | expected   |

The overarching point is we're not looking to test that an upload service exists and is working (not our problem),we're looking to test the pipeline is making the required outgoing requests.

Note - dont add a feature flag for this, its the last thing that happens, I'm entirely fine with the pipelines error'ing on the last step until we get them into a real env.

Acceptance Criteria

mikeAdamss commented 5 months ago

also rename temporary.py to something else.

also remove temporary.feature.