In #541 I introduced a test suite based on VCR.py cassettes, however over time I've learned that the Google Trends API returns different data for what should be already consolidated information (e.g. search terms for 2021), for example if we try to update the cassette of the test test_interest_over_time_ok right now we get this error:
E AssertionError: DataFrame.iloc[:, 0] (column name="pizza") are different
E
E DataFrame.iloc[:, 0] (column name="pizza") values are different (60.0 %)
E [index]: [2021-01-01T00:00:00.000000000, 2021-01-02T00:00:00.000000000, 2021-01-03T00:00:00.000000000, 2021-01-04T00:00:00.000000000, 2021-01-05T00:00:00.000000000]
E [left]: [100, 81, 78, 48, 51]
E [right]: [100, 84, 78, 50, 52]
My approach for this problem was to add documentation on how to replace these results in the repository's contributing guidelines, however while trying to fix #566 I found that I have to update all the cassettes one by one; it's a big, time-consuming chore that not many contributors may be (reasonably) willing to do.
Ideally we should have a system to automate the update of the expected DataFrames allowing the user to inspect the new result to see if it's valid or not (e.g. if a bad implementation produce all zeroes we should know before replacing all the expected DataFrames).
I propose a system to update the cassettes almost automatically by leveraging the management of the DataFrame responses in a pytest fixture:
instead of beign a visible pd.DataFrame in the code, make the expected DataFrames of every test a serialized JSON file in a given path dependent on the test name.
I chose JSON because it preserves the column dtypes and is human readable.
inject the expected DataFrame at runtime in the fixture.
create a custom assert_frame_equal that raises a specific exception type which contains both the expected DataFrame and the DataFrame generated by the test.
add a custom pytest flag --rewrite-dataframes that:
catches our custom raised exception.
show both DataFrames to the user and ask him/her to evalute if the new result is valid.
if the user answer yes, replace the expected DataFrame with the new result and execute the test again.
There may be some details that I don't catch right now but that's the main idea.
Please @emlazzarin tell me if it goods look to you and I'll implement it.
In #541 I introduced a test suite based on VCR.py cassettes, however over time I've learned that the Google Trends API returns different data for what should be already consolidated information (e.g. search terms for 2021), for example if we try to update the cassette of the test
test_interest_over_time_ok
right now we get this error:My approach for this problem was to add documentation on how to replace these results in the repository's contributing guidelines, however while trying to fix #566 I found that I have to update all the cassettes one by one; it's a big, time-consuming chore that not many contributors may be (reasonably) willing to do.
Ideally we should have a system to automate the update of the expected DataFrames allowing the user to inspect the new result to see if it's valid or not (e.g. if a bad implementation produce all zeroes we should know before replacing all the expected DataFrames).
I propose a system to update the cassettes almost automatically by leveraging the management of the DataFrame responses in a pytest fixture:
pd.DataFrame
in the code, make the expected DataFrames of every test a serialized JSON file in a given path dependent on the test name.assert_frame_equal
that raises a specific exception type which contains both the expected DataFrame and the DataFrame generated by the test.--rewrite-dataframes
that:There may be some details that I don't catch right now but that's the main idea.
Please @emlazzarin tell me if it goods look to you and I'll implement it.