dataverbinders / statline-bq

Library to fetch CBS open datasets into parquet and optionally load into Google Cloud Storage and BigQuery
MIT License
0 stars 0 forks source link

Implement basic unit testing #72

Closed dkapitan closed 3 years ago

dkapitan commented 3 years ago

I propose we aim for a first 1.0 version which is released on pypi and aims for users to use the CLI resulting either in just parquet files (using the pagination feature) or full-on GBQ.

As good practice (and good exercise), implement unit testing using pytest following the guidelines from Real Python:

Scope:

Starting point:

pytest --cov=statline_bq tests/
================================================================= test session starts ==================================================================
platform darwin -- Python 3.8.2, pytest-5.4.3, py-1.10.0, pluggy-0.13.1
rootdir: /Users/dkapitan/git/dataverbinders/statline-bq
plugins: cov-2.11.1
collected 1 item

tests/test_statline_bq.py .                                                                                                                      [100%]

---------- coverage: platform darwin, python 3.8.2-final-0 -----------
Name                      Stmts   Miss  Cover
---------------------------------------------
statline_bq/__init__.py       1      0   100%
statline_bq/cli.py           33     33     0%
statline_bq/config.py        19     19     0%
statline_bq/utils.py        388    388     0%
---------------------------------------------
TOTAL                       441    440     1%
dkapitan commented 3 years ago

@galamit86 I got up to 18%, but from here it is a lot of testing how the code interacts with GCP. Not sure what's the best way to go about this. Feels a bit overkill to test wrapper functions that we've written for the GCP Python clients.

The parts where logic is processed, e.g. looking up the latest version, seems worth writing tests for but not sure how to do that. I think that no longer qualifies as a unit test, but it actually is an integration test. One thing I can think of is to setup (and teardown) a whole GCP environment for these types of integration tests, but that's just too much work ... ;-)

Perhaps I was a bit too optimistic. Reading up on Martin Fowler's Test Pyramid approach going for 100% coverage doesn't make sense. The key thing is to test the public interface of the functions.

So perhaps it's good that when you are refactoring, you also explicitly mark which methods are public, and which one aren't using the _function_name convention.

_single_leading_underscore: weak "internal use" indicator. E.g. from M import * does not import objects whose names start with an underscore.

What do you think?

dkapitan commented 3 years ago

For GCS we could use an emulator:

galamit86 commented 3 years ago

@dkapitan

  1. I'm happy to add a separation between "user-API" and non-user things, to #73, after that is done we can come back here and discuss best test strategies.
  2. I also agree that most of our more relevant tests seem to be integration tests rather than unit tests - I think that is because we mostly curate and assemble existing functions and methods - native python, google apis, etc.
  3. For the GCP - we could use an emulator - but if we use an actual GCP environment - why would we need to tear it down each time? What's wrong with creating, populating and maintaining an environment that exists for testing purposes only? I thought about having a single project (so we won't have the dev, test and prod division), with a couple of folders on GCS and a couple of datasets on BQ.
dkapitan commented 3 years ago

Sounds like a plan: separate public from private methods and then setup an GCP environment for testing. Let's use dataverbinders-test for that. Although it feels like a bit of overkill, I think it's best to keep them separated.

And since our whole setup is portable anyway, we could always copy the GCS datalake from dev/test to prod.

dkapitan commented 3 years ago

@galamit86 now that the refactoring is done, could you take it up from here with the following:

galamit86 commented 3 years ago

@dkapitan This is a good opportunity to be explicit about our git/github workflow, and maybe fix some things I might be doing wrong. As it stands, to integrate our work and continue working, I would:

  1. Pull your work (branch origin/issue-73-implement-testing) to a new local branch on my machine.
  2. Rebase it onto my local master (which already includes the refactoring, and is of course the same as origin/master)
  3. Make any changes needed to make sure it runs correctly (none needed in this instance).
  4. Push it back to origin (origin/issue-73-implement-testing), using push --force
  5. Create a PR in github, and use "rebase & merge" to merge origin/issue-73-implement-testing onto origin/master.
  6. Pull origin/master locally
  7. Delete issue-73-implement-testing from origin and locally.
  8. Create a new branch locally to continue work.

That last part feels strange (having to delete and recreate branches), but I haven't figure out a better way yet - as far as I understand it, rebasing creates a different history, requiring push --force, and making the deletion necessary. Do you have a better way in mind?

dkapitan commented 3 years ago

@galamit86 Sounds good to me. Deleting it seems OK. As I understand it, strictly speaking a ticket/issue should have a non-changing scope. Because we don't adhere to that, we re-create branches as we go along. No tension in any case.

galamit86 commented 3 years ago

@dkapitan Great.

One more question - in the interest of keeping a tidy history, are you ok with using "fixup" to remove your WIP commit, effectively merging it with the commit that came before it?

As seen here

dkapitan commented 3 years ago

@galamit86 More than ok, is definitely a lot cleaner

galamit86 commented 3 years ago

@dkapitan What do you think of this implementation for an integration test?

There are a couple of specific issues (below) I'm not certain of, but also wondering if the general setup looks proper to you.

Also added this implementation to test upload_to_gcp.

dkapitan commented 3 years ago

@galamit Storing 20MB of files is fine, we don't need LFS for now. If would elaborate the test a bit more, indeed to check new version within v3, and also with _check_v4

Finally, I would put the configuration of the actual GCP test project in config.toml instead of hardcoding it. Or a separate toml in tests is fine by me, too

galamit86 commented 3 years ago

@dkapitan Updated code, adding two assertions:

I'm also reading the config from file: CONFIG = config.get_config("statline_bq/config.toml") - meaning I use the actual config file that we use for the library, and then I dot into the test part: CONFIG.gcp.test.project_id.

Finally, something strange happened when running the tests just now. One of the datasets, for 84799NED failed, and the reason is that the metadata file I downloaded 3 days ago, and I use as truth is slightly different that the one I get when downloading it now. The older one has "ID": 936, while the new one has "ID": 937. No other fields are different, including "Updated", "Modified" or "MetaDataModified".

I am not sure how to address this. My feeling is that it's not worth spending more time digging into this trying to understand what exactly happened, so I want to update update the truth files, and move on. What do you think? Maybe this also means we should introduce this "skip" into the actual test, and not compare the full metadata file?

dkapitan commented 3 years ago

@galamit86 agree: let's just update the truth files for now

galamit86 commented 3 years ago

@dkapitan Same issue happened again - ID field updated from 937 to 939 in the metadata. The information is taken from the CBS catalog here. I have not found official documentation saying so, but it seems clear to me that the ID is not relevant, and can be ignored for the purpose of validating the data. I plan to add a specific skip in the test, to avoid the tests from unnecessarily failing constantly.

Let me know if you have an objection or a different perspective.

dkapitan commented 3 years ago

No objection