Closed dkapitan closed 3 years ago
@galamit86 I got up to 18%, but from here it is a lot of testing how the code interacts with GCP. Not sure what's the best way to go about this. Feels a bit overkill to test wrapper functions that we've written for the GCP Python clients.
The parts where logic is processed, e.g. looking up the latest version, seems worth writing tests for but not sure how to do that. I think that no longer qualifies as a unit test, but it actually is an integration test. One thing I can think of is to setup (and teardown) a whole GCP environment for these types of integration tests, but that's just too much work ... ;-)
Perhaps I was a bit too optimistic. Reading up on Martin Fowler's Test Pyramid approach going for 100% coverage doesn't make sense. The key thing is to test the public interface of the functions.
So perhaps it's good that when you are refactoring, you also explicitly mark which methods are public, and which one aren't using the _function_name
convention.
_single_leading_underscore: weak "internal use" indicator. E.g. from M import * does not import objects whose names start with an underscore.
What do you think?
For GCS we could use an emulator:
fake-gcs-server
, written in Go@dkapitan
dev
, test
and prod
division), with a couple of folders on GCS and a couple of datasets on BQ.Sounds like a plan: separate public from private methods and then setup an GCP environment for testing. Let's use dataverbinders-test for that. Although it feels like a bit of overkill, I think it's best to keep them separated.
And since our whole setup is portable anyway, we could always copy the GCS datalake from dev/test to prod.
@galamit86 now that the refactoring is done, could you take it up from here with the following:
dataverbinders-test
project
@dkapitan This is a good opportunity to be explicit about our git/github workflow, and maybe fix some things I might be doing wrong. As it stands, to integrate our work and continue working, I would:
origin/issue-73-implement-testing
) to a new local branch on my machine.master
(which already includes the refactoring, and is of course the same as origin/master
)origin/issue-73-implement-testing
), using push --force
origin/issue-73-implement-testing
onto origin/master
.origin/master
locallyissue-73-implement-testing
from origin and locally.That last part feels strange (having to delete and recreate branches), but I haven't figure out a better way yet - as far as I understand it, rebasing creates a different history, requiring push --force
, and making the deletion necessary. Do you have a better way in mind?
@galamit86 Sounds good to me. Deleting it seems OK. As I understand it, strictly speaking a ticket/issue should have a non-changing scope. Because we don't adhere to that, we re-create branches as we go along. No tension in any case.
@dkapitan Great.
One more question - in the interest of keeping a tidy history, are you ok with using "fixup" to remove your WIP commit, effectively merging it with the commit that came before it?
As seen here
@galamit86 More than ok, is definitely a lot cleaner
@dkapitan What do you think of this implementation for an integration test?
There are a couple of specific issues (below) I'm not certain of, but also wondering if the general setup looks proper to you.
tests/data/SOME_ID
. I've only pushed one of them (83585NED
), and all 4 datasets together are about 20MB. Is there a way to place them elsewhere? Should we look at Git large file storage for this?last_modified
that we do in _skip_dataset
beforehand, and indicate that somehow?v3
to v4
. main
will automatically recognise this, and process the v4
dataset. This will also cause a failure, although in a less straightforward way. What will happen in this case, is that GCS_FOLDER = f"{SOURCE}/{ODATA_VERSION}/{ID}/{datetime.today().date().strftime('%Y%m%d')}"
will become a folder that does not exist (there should be no v3
with today's date). The blob generator
produce no items, and the assertion_paths
dictionary will be empty - failing on the first assert
Again, I can use _check_v4
to check the version matches, and indicate that on failure.Also added this implementation to test upload_to_gcp
.
config.get_config()
to get the config items. Do you think that's ok? I saw you create a mock config file to test the get_config
itself, but once that is tested, I guess it's okay to actually use the function in a test, and expect the config file to exist?@galamit Storing 20MB of files is fine, we don't need LFS for now. If would elaborate the test a bit more, indeed to check new version within v3, and also with _check_v4
Finally, I would put the configuration of the actual GCP test project in config.toml instead of hardcoding it. Or a separate toml in tests is fine by me, too
@dkapitan Updated code, adding two assertions:
new_modified == test_modified
checking the metadata modified
fieldodata_version == ODATA_VERSION
checking whether the dataset was updated from v3 to v4.I'm also reading the config from file: CONFIG = config.get_config("statline_bq/config.toml")
- meaning I use the actual config file that we use for the library, and then I dot into the test
part: CONFIG.gcp.test.project_id
.
Finally, something strange happened when running the tests just now. One of the datasets, for 84799NED
failed, and the reason is that the metadata file I downloaded 3 days ago, and I use as truth
is slightly different that the one I get when downloading it now. The older one has "ID": 936
, while the new one has "ID": 937
. No other fields are different, including "Updated"
, "Modified"
or "MetaDataModified"
.
I am not sure how to address this. My feeling is that it's not worth spending more time digging into this trying to understand what exactly happened, so I want to update update the truth
files, and move on. What do you think? Maybe this also means we should introduce this "skip" into the actual test, and not compare the full metadata file?
@galamit86 agree: let's just update the truth files for now
@dkapitan
Same issue happened again - ID
field updated from 937
to 939
in the metadata.
The information is taken from the CBS catalog here. I have not found official documentation saying so, but it seems clear to me that the ID
is not relevant, and can be ignored for the purpose of validating the data. I plan to add a specific skip in the test, to avoid the tests from unnecessarily failing constantly.
Let me know if you have an objection or a different perspective.
No objection
I propose we aim for a first 1.0 version which is released on pypi and aims for users to use the CLI resulting either in just parquet files (using the pagination feature) or full-on GBQ.
As good practice (and good exercise), implement unit testing using pytest following the guidelines from Real Python:
Scope:
pytest-cov
, aim for 90%+ code coveragemaster
get pushed to pypiStarting point: