UDST / orca_test

Data assertions for the Orca task orchestrator
BSD 3-Clause "New" or "Revised" License
0 stars 3 forks source link

YAML syntax and additional assertions on value consistency? #2

Open semcogli opened 7 years ago

semcogli commented 7 years ago

First, this project is exactly something we are looking for to help our model input development. We actually developed a tool similar to this in-house during our last forecast. And we did plan to improve that tool with additional functionalities in the coming months. But I can see this tool is well-suited and better-structured substitute to our old tool. So I will definitely try to integrate it into our work this time.

Here's my thoughts about the features so far. I would like to see the YAML syntax implemented at your earliest convenience :) Our old database check list on the tables and columns are quite extensive. So I used an csv file to store the tables, columns and expected 'assertions'. I can see the the conversion of that to YAML will be much less time consuming (probably less error-prone) than to current dictionary based syntax.

I would also like to see more assertion test provided. A couple of things: 'foreign key' works well in my test, but what if the target tables are multi-indexed? Will the tool automatically takes the proper level of index? Also, what is the target 'foreign key' column is not indexed? we do find situations that just verify the consistency of two columns disregard whether they are indexed or not. So to the simple, an expansion of 'foreign key' test on unindexed columns would be useful.

Also, any assertions on expected value or list of values? For example, we have large_area_ids in our parcels table. Can I test those ids are matching a predefined list?

Thanks.

semcogli commented 7 years ago

I am adding a request for MultiIndex test. I think this is something definitely needed.

Our database have tables like "annual_employment_control_totals", which contains multilevel index "year", " large_area_id" and "sector_id". Primary key test does not work in this situation. A simple solution is to add a multiindex assertion. Some sample codes as follows ( not checking the order of index though but probably good enough)

code snippets for test:

multiindex_cols=[ ] if (k, v) == ('multiindex', True): multiindex_cols.append(k)

if len(multiindex_cols)>0: assert_columns_are_multiindex(table_name, multiindex_cols)


def assert_columns_are_multiindex(table_name, multiindex_cols): """ doc string here """

try:
    idx = orca.get_table(table_name).index
    assert set(idx.names) == set (multiindex_cols)
except:
    msg = "Column '%s' is not set as the index of table '%s'" \
            % (multiindex_cols, table_name)
    raise OrcaAssertionError(msg)

try:
    assert len(idx.unique()) == len(idx)
except:
    msg = "Column '%s' is the index of table '%s' but its values are not unique" \
            % (multiindex_cols, table_name)
    raise OrcaAssertionError(msg)

try:
    assert sum(pd.isnull(idx)) == 0
except:
    msg = "Column '%s' is the index of table '%s' but it contains missing values" \
            % (multiindex_cols, table_name)
    raise OrcaAssertionError(msg)

return

smmaurer commented 7 years ago

Thank you, this is good feedback! I agree that the YAML syntax will be helpful. I don't think we have any code for that yet, but it should be fairly straightforward to implement. We might be able to borrow code from the UrbanSim functions for working with yaml-based settings and model specs.

Regarding the indexes, we should probably come up with a unified approach for how Orca_test treats them. Here are some potential cases, to get us started:

  1. Column is an index of underlying DataFrame
  2. Column is an index, plus its values are unique and non-missing
  3. Column's values correspond to index of another table
  4. Column's values correspond to index of another table, and are non-missing
  5. Columns are a multi-index of underlying DataFrame
  6. Additional multi-index uniqueness and missing-ness cases?
  7. Others?

Currently, the primary_key spec represents the 2nd case, and the foreign_key spec represents the 4th case. How many permutations do we want to handle with Orca_test?

Some criteria might be: (a) it's a plausible and intended use of Orca, and (b) any missing piece would potentially break model step logic.

What do you think? I'll have to read up a bit on Orca and on DataFrame indexes to get a better idea of what the plausible and intended use cases are. Let's leave this issue open and use it for discussion of how we want to handle this.

semcogli commented 7 years ago

I strongly agree the 2 criteria you proposed. Since the test is intended to work with UrbanSim code, it should follow the standards and expectations of the model.

Additional index checking could be, column values are correspond to multiindex of another table, plus the uniqueness and missing values and so on. So it may end up with many more tests. I am thinking, whether we can simply the case by focusing on index and value-index combinations only. But let user choose, as options of index test, the additional uniqueness and non-missing tests. what do you think?