Implement basic unit testing

dkapitan commented 3 years ago

I propose we aim for a first 1.0 version which is released on pypi and aims for users to use the CLI resulting either in just parquet files (using the pagination feature) or full-on GBQ.

As good practice (and good exercise), implement unit testing using pytest following the guidelines from Real Python:

Scope:

support Python 3.6+ (using tox); we choose 3.6 as minimal version because we use f-strings (see this overview of Python versions)
support coverage with pytest-cov, aim for 90%+ code coverage
setup CI/CD on GitHub, such that new versions to master get pushed to pypi

Starting point:

pytest --cov=statline_bq tests/
================================================================= test session starts ==================================================================
platform darwin -- Python 3.8.2, pytest-5.4.3, py-1.10.0, pluggy-0.13.1
rootdir: /Users/dkapitan/git/dataverbinders/statline-bq
plugins: cov-2.11.1
collected 1 item

tests/test_statline_bq.py .                                                                                                                      [100%]

---------- coverage: platform darwin, python 3.8.2-final-0 -----------
Name                      Stmts   Miss  Cover
---------------------------------------------
statline_bq/__init__.py       1      0   100%
statline_bq/cli.py           33     33     0%
statline_bq/config.py        19     19     0%
statline_bq/utils.py        388    388     0%
---------------------------------------------
TOTAL                       441    440     1%

dkapitan commented 3 years ago

@galamit86 I got up to 18%, but from here it is a lot of testing how the code interacts with GCP. Not sure what's the best way to go about this. Feels a bit overkill to test wrapper functions that we've written for the GCP Python clients.

The parts where logic is processed, e.g. looking up the latest version, seems worth writing tests for but not sure how to do that. I think that no longer qualifies as a unit test, but it actually is an integration test. One thing I can think of is to setup (and teardown) a whole GCP environment for these types of integration tests, but that's just too much work ... ;-)

Perhaps I was a bit too optimistic. Reading up on Martin Fowler's Test Pyramid approach going for 100% coverage doesn't make sense. The key thing is to test the public interface of the functions.

So perhaps it's good that when you are refactoring, you also explicitly mark which methods are public, and which one aren't using the _function_name convention.

_single_leading_underscore: weak "internal use" indicator. E.g. from M import * does not import objects whose names start with an underscore.

What do you think?

dkapitan commented 3 years ago

For GCS we could use an emulator:

gcp-storage-emulator
fake-gcs-server, written in Go

galamit86 commented 3 years ago

@dkapitan

I'm happy to add a separation between "user-API" and non-user things, to #73, after that is done we can come back here and discuss best test strategies.
I also agree that most of our more relevant tests seem to be integration tests rather than unit tests - I think that is because we mostly curate and assemble existing functions and methods - native python, google apis, etc.
For the GCP - we could use an emulator - but if we use an actual GCP environment - why would we need to tear it down each time? What's wrong with creating, populating and maintaining an environment that exists for testing purposes only? I thought about having a single project (so we won't have the dev, test and prod division), with a couple of folders on GCS and a couple of datasets on BQ.

dkapitan commented 3 years ago

Sounds like a plan: separate public from private methods and then setup an GCP environment for testing. Let's use dataverbinders-test for that. Although it feels like a bit of overkill, I think it's best to keep them separated.

And since our whole setup is portable anyway, we could always copy the GCS datalake from dev/test to prod.

dkapitan commented 3 years ago

@galamit86 now that the refactoring is done, could you take it up from here with the following:

[x] integrate this branch with refactored code
[x] merge into master, so the minimal unit test setup is integrated
[x] setup integration test on dataverbinders-test project
- [x] test GCS upload function
- [x] add tests/data for comparing handful of datasets which are processed from start to finish (up to GCS)
[x] setup GitHub actions for CI/CD
- [x] running unit & integrations tests on merge with master
- [x] add service account to GitHub actions

galamit86 commented 3 years ago

@dkapitan This is a good opportunity to be explicit about our git/github workflow, and maybe fix some things I might be doing wrong. As it stands, to integrate our work and continue working, I would:

Pull your work (branch origin/issue-73-implement-testing) to a new local branch on my machine.
Rebase it onto my local master (which already includes the refactoring, and is of course the same as origin/master)
Make any changes needed to make sure it runs correctly (none needed in this instance).
Push it back to origin (origin/issue-73-implement-testing), using push --force
Create a PR in github, and use "rebase & merge" to merge origin/issue-73-implement-testing onto origin/master.
Pull origin/master locally
Delete issue-73-implement-testing from origin and locally.
Create a new branch locally to continue work.

That last part feels strange (having to delete and recreate branches), but I haven't figure out a better way yet - as far as I understand it, rebasing creates a different history, requiring push --force, and making the deletion necessary. Do you have a better way in mind?

dkapitan commented 3 years ago

@galamit86 Sounds good to me. Deleting it seems OK. As I understand it, strictly speaking a ticket/issue should have a non-changing scope. Because we don't adhere to that, we re-create branches as we go along. No tension in any case.

galamit86 commented 3 years ago

@dkapitan Great.

One more question - in the interest of keeping a tidy history, are you ok with using "fixup" to remove your WIP commit, effectively merging it with the commit that came before it?

As seen here

dkapitan commented 3 years ago

@galamit86 More than ok, is definitely a lot cleaner

galamit86 commented 3 years ago

@dkapitan What do you think of this implementation for an integration test?

There are a couple of specific issues (below) I'm not certain of, but also wondering if the general setup looks proper to you.

I currently store the "true" parquet files to match against in tests/data/SOME_ID. I've only pushed one of them (83585NED), and all 4 datasets together are about 20MB. Is there a way to place them elsewhere? Should we look at Git large file storage for this?
If the source data is updated, this test will fail - since the new file will be different. Do you think we should use the checking last_modified that we do in _skip_dataset beforehand, and indicate that somehow?
Another thing that might happen, is when a dataset is updated from v3 to v4. main will automatically recognise this, and process the v4 dataset. This will also cause a failure, although in a less straightforward way. What will happen in this case, is that GCS_FOLDER = f"{SOURCE}/{ODATA_VERSION}/{ID}/{datetime.today().date().strftime('%Y%m%d')}" will become a folder that does not exist (there should be no v3 with today's date). The blob generator produce no items, and the assertion_paths dictionary will be empty - failing on the first assert Again, I can use _check_v4 to check the version matches, and indicate that on failure.

Also added this implementation to test upload_to_gcp.

In both test setups, I'm using config.get_config() to get the config items. Do you think that's ok? I saw you create a mock config file to test the get_config itself, but once that is tested, I guess it's okay to actually use the function in a test, and expect the config file to exist?

dkapitan commented 3 years ago

@galamit Storing 20MB of files is fine, we don't need LFS for now. If would elaborate the test a bit more, indeed to check new version within v3, and also with _check_v4

Finally, I would put the configuration of the actual GCP test project in config.toml instead of hardcoding it. Or a separate toml in tests is fine by me, too

galamit86 commented 3 years ago

@dkapitan Updated code, adding two assertions:

new_modified == test_modified checking the metadata modified field
odata_version == ODATA_VERSION checking whether the dataset was updated from v3 to v4.

I'm also reading the config from file: CONFIG = config.get_config("statline_bq/config.toml") - meaning I use the actual config file that we use for the library, and then I dot into the test part: CONFIG.gcp.test.project_id.

Finally, something strange happened when running the tests just now. One of the datasets, for 84799NED failed, and the reason is that the metadata file I downloaded 3 days ago, and I use as truth is slightly different that the one I get when downloading it now. The older one has "ID": 936, while the new one has "ID": 937. No other fields are different, including "Updated", "Modified" or "MetaDataModified".

I am not sure how to address this. My feeling is that it's not worth spending more time digging into this trying to understand what exactly happened, so I want to update update the truth files, and move on. What do you think? Maybe this also means we should introduce this "skip" into the actual test, and not compare the full metadata file?

dkapitan commented 3 years ago

@galamit86 agree: let's just update the truth files for now

galamit86 commented 3 years ago

@dkapitan Same issue happened again - ID field updated from 937 to 939 in the metadata. The information is taken from the CBS catalog here. I have not found official documentation saying so, but it seems clear to me that the ID is not relevant, and can be ignored for the purpose of validating the data. I plan to add a specific skip in the test, to avoid the tests from unnecessarily failing constantly.

Let me know if you have an objection or a different perspective.

dkapitan commented 3 years ago

No objection

dataverbinders / statline-bq

Implement basic unit testing #72