KarenJewell commented 2 years ago

See conversation in context: #151

from @gavbarnett

generate_new_mock_data.py could certainly be improved. likely you'd like it to be more selective in what it does, rather than a blanket update of everything. but that's a separate issue/feature

and from me (granted, may or may not be related):

But because we are using exact copies of data files (including content) it seems to be failing where the content between expected output (static) and test output (generated) is significantly different - in the cases I've seen, where the order of the content is different. So this makes it really easy to fail since we don't consistently order the content post-retrieval. I also have much much bigger questions about whether we should be using exact copies of data, rather than making a dummy set of data for test purposes only. Exact copies of data will age really quickly.

gavbarnett commented 2 years ago

To expand further:

Currently this script gets the JSON output from the URLs listed in sources.csv and stores these as mock API data for future tests.

It does this so pytest doesn't ever need to call the real URL to get data (as that data changes all the time and we need static tests). (There were some changes made to the API scrapers to accommodate this shim/mock redirection for testing)

The script then also generates the API scrapers CSV files from the mocked JSON output above. This is done to create a expected result for future tests.

When the script is run it deletes all existing mock data (JSON & CSV output) and regenerates them.

It is the intention that this script is run infrequently when either:

the JSON response format changes from one of the APIs
the CSV handling format changes
we feel like more up to date test data is appropriate.

Suggested Improvements

Make the following possible with use of terminal flags etc. when calling the script.

Separate the CSV re-generation from the JSON collection. This would allow for us to change the CSV formatting with existing JSON data.
Allow for only updating certain sources.
Allow for duplicate sources across time (I think this is simply adding a timestamp to the file names to allow this)
work out a way to handle linked list style JSON responses. My current work around is to pretend they don't exist 😂. This might only be possible once the project folder structure is better.

gavbarnett commented 2 years ago

Seems like there is still a newline character issue that needs resolved here too.🪲

It's got something to do with how git automatically switches line endings behind the scenes. But it's messing up the mock CSV files on Windows now.

OpenDataScotland / the_od_bods

Improve generate_new_mock_data.py #159

Suggested Improvements