Features
-- validating attr types in get_features_for_matching, get_features_for_blocking
-- multiple cross validation metrics
-- down sample seed
Detailed changes:
The select_matcher function can now display multiple metrics after only being run once. There are now two arguments replacing metric. The first, 'metric_to_select_matcher' is the metric that will be used to evaluate the matchers and is essentially the same as the old 'metric' variable. The second, 'metrics_to_display' is a list of metric names that specify which metric stats will be given to the user. The default is all three metrics: precision, recall, and f1. In the code, a check was added to determine that each of the above arguments are a string or list of strings that is one of 'precision', 'recall', or 'f1'. The function now returns a dictionary with three keys. selected_matcher is the selected matcher, cv_stats is a Dataframe which includes the average cross validation scores for each matcher and for each metric, and 'drill_down_cv_stats' is a dictionary where each key is a metric that includes the cross validation statistics for each fold.
A seed variable has been added to the down_sample in order to add seed functionality. The seed is set as a default to None. An if statement first checks if seed is None. If not, it creates a new RandomState with the user's seed. Otherwise it creates a new RandomState without a seed. The choice function is now called from the new RandomState variable instead of np.random. This allows the tuples from table A to be selected with the seed. Next, a seed variable, with default value None, was added to _probe_index to allow the seed functionality for selecting tuples from table B. Here, an if statements checks if the seed variable is not None and if the check is passed it calls the function random.seed with the user's seed as the argument. This function sets the seed for when random.randint is called next to select tuples from table B.
A flag was added to the functions ‘get_features_for_blocking’ and ‘get_features_for_matching’ that when true will display a Dataframe showing the inferred attribute correspondence and inferred attribute types. This allows the user to check the information before moving on. The function then prompts the user to see if it should continue creating the features or quit.
A bug where the show_progress variable was not working for the overlap blocker has been fixed.
Many documentation changes were made in response to suggestions from students in CS 838. These changes include changes to the user manual, API documentation, and jupyter notebooks. Additionally, the documentation for changed functions, including the API documentation, the user guide, and the notebook were all updated to reflect these changes.
Summary
Detailed changes:
The select_matcher function can now display multiple metrics after only being run once. There are now two arguments replacing metric. The first, 'metric_to_select_matcher' is the metric that will be used to evaluate the matchers and is essentially the same as the old 'metric' variable. The second, 'metrics_to_display' is a list of metric names that specify which metric stats will be given to the user. The default is all three metrics: precision, recall, and f1. In the code, a check was added to determine that each of the above arguments are a string or list of strings that is one of 'precision', 'recall', or 'f1'. The function now returns a dictionary with three keys. selected_matcher is the selected matcher, cv_stats is a Dataframe which includes the average cross validation scores for each matcher and for each metric, and 'drill_down_cv_stats' is a dictionary where each key is a metric that includes the cross validation statistics for each fold.
A seed variable has been added to the down_sample in order to add seed functionality. The seed is set as a default to None. An if statement first checks if seed is None. If not, it creates a new RandomState with the user's seed. Otherwise it creates a new RandomState without a seed. The choice function is now called from the new RandomState variable instead of np.random. This allows the tuples from table A to be selected with the seed. Next, a seed variable, with default value None, was added to _probe_index to allow the seed functionality for selecting tuples from table B. Here, an if statements checks if the seed variable is not None and if the check is passed it calls the function random.seed with the user's seed as the argument. This function sets the seed for when random.randint is called next to select tuples from table B.
A flag was added to the functions ‘get_features_for_blocking’ and ‘get_features_for_matching’ that when true will display a Dataframe showing the inferred attribute correspondence and inferred attribute types. This allows the user to check the information before moving on. The function then prompts the user to see if it should continue creating the features or quit.
A bug where the show_progress variable was not working for the overlap blocker has been fixed.
Many documentation changes were made in response to suggestions from students in CS 838. These changes include changes to the user manual, API documentation, and jupyter notebooks. Additionally, the documentation for changed functions, including the API documentation, the user guide, and the notebook were all updated to reflect these changes.