Quantco / datajudge

Assessing whether data from database complies with reference information.
https://datajudge.readthedocs.io
BSD 3-Clause "New" or "Revised" License
42 stars 3 forks source link

Way to provide custom name for `get_description()` #6

Closed jonashaag closed 1 year ago

jonashaag commented 2 years ago

Currently it's not obvious how to select a single constraint when using pytest -k. The pytest test names are taken from get_description() which may be something like ...table1|table2. Problems with that:

Idea: Provide a way to add your own name to each of the requirements/constraints?

ivergara commented 2 years ago

It's great to know that the -k argument works with the supplied id of a parametrization.

I'm in favor of adding some extra fields to Requirement and Constraint classes, like a name and a description.

YYYasin19 commented 2 years ago

I think we may need this for two reasons:

  1. filtering with pytest's -k
  2. better readability, especially with a large test/requirement suite

An easy solution would be to provide a metadata attribute to the Requirement object that allows the user to pass some custom identfier / explaning argument. Example:

req = BetweenRequirement.from_tables(...)
req.add_description("test_identifier")
req.add_uniques_equality_constraint(columns1, columns2)

Filtering

Another use case might be to group tests together like this

req = BetweenRequirement.from_tables(...)
req.add_description("Customer Validation") # what happens to whitespaces?
req.add_uniques_equality_constraint(columns1, columns2)

req2 = WithinRequirement.from_table(...)
req2.add_description("Customer Validation")
req2.add_uniqueness_constraint(columns)

and then pytest -k "'Customer Validation'"

The latter could maybe also be achieved by adding a mark attribute to the Requirement or Constraint and then adding that mark programmatically to the pytest constraints by using a pytest hook (e.g. pytest_collection_modifyitems) (Haven't tried this yet). Then, the user can just go

# still don't know about case or whitespace
pytest -m customer_validation 

Better Understandability

This might be just constraint-specific stuff. Example: For the Uniqueness Constraint, I would love to see in the output which columns were tested.

kklein commented 2 years ago

better readability, especially with a large test/requirement suite

I'm not exactly sure I understand what you mean. Especially with a very large test suite - with say >1000 tests - how do custom names, help with readability?

The latter could maybe also be achieved by adding a mark attribute

Exactly! This is done at QuantCo on several projects. This is also elaborated on in this article[0] of our documentation.

This might be just constraint-specific stuff. Example: For the Uniqueness Constraint, I would love to see in the output which columns were tested.

I would think that this kind of information could be conveyed through the assertion text. In the particular example you give I'm surprised that the columns aren't already part of the assertion text, see this piece of code[1]. If you have found a counterexample/bug, please feel free to open a separate issue!

[0] https://datajudge.readthedocs.io/en/latest/testing.html [1] https://github.com/Quantco/datajudge/blob/main/src/datajudge/constraints/miscs.py#L92-L95

YYYasin19 commented 2 years ago

Especially with a very large test suite - with say >1000 tests - how do custom names, help with readability?

Custom names might help because I might be using the same constraint for different cases. An example: I have 5 dumps with ~30 tables each and there are 10 distinct tests for each table pair from old -> new dump (for the BetweenRequirements) I might be using Uniqueness to validate something and then also validate primary keys for all of my tables as well while I'm there. This will cause my list to have 300 (= 5 30 2) test entries that all start with Uniqueness::{dump_name}.... If I could have something like a metadata field that is printed along, half of those might be called UniqueColumn and the other half maybe PrimaryKeyValidation etc.

I think the main points here are:

already part of the assertion text

Good point! I think we rather focused on having such information in the test name, but that might be an edge case for datajudge.