SenteraLLC / geoml

API to retrieve training data, create X matrix, and perform feature selection, hyperparameter tuning, training, and model testing.
MIT License
1 stars 0 forks source link

Create `Training` class that inherits from FeatureSelection #14

Closed tnigon closed 4 years ago

tnigon commented 4 years ago

Not sure if this is the best way to architect this, but I don't see any glaring reason that it wouldn't work well for our purposes.

Basically, the idea is to have a feature_data class instance for any data subset that we would like to train on. This could be satellite image data, drone imagery, weather data, management data, etc., or any combination thereof (see issue #9). The role of the feature_data instance is to provide functionality to access the desired data, join together into a cohesive dataframe, and create the X and y matrices that will be used by sklearn.

Next, we have to perform hyperparameter tuning for whichever model we decide to use. To do this, I propose we create a tuning class that inherits from feature_data - basically, it needs all of the data we just accessed and organized using feature_data to actually perform the tuning.

To do

  1. Create tuning.py with a class named Tuning.
  2. Determine the variables that must passed to the tuning class on initialization (must inherit from FeatureSelection).
  3. Add some basic functions from the sip_functions spreasheet that will be used when we get into meat and potatoes of the tuning

Think about how tuning results should be stored, especially in regards to training multiple different models (Lasso, PLSR, random forest, etc.) Also think about how feature selection should be handled. Is this a "built-in" for tuning, or is it perhaps an additional class object (that probably also inherits from feature_data?

tnigon commented 4 years ago
  1. Tuning inherits from FeaturesSelection. The inherited design should allow for a seamless transition among various models - e.g., run Lasso first, then run PLS (resetting regressor, regressor_params, and param_grid). df_tune_filter will include the highest scoring tuning results for each number of features and each model (sorted by "regressor", then by "feat_n").
  2. Testing covers both the Lasso and PLS regression models.
  3. We still have to implement training (I'd say in the same class) - consider renaming to 'Training'.

Use

from research_tools import feature_groups
from research_tools import Tuning

my_tune = Tuning(param_dict=feature_groups.param_dict_test)
my_tune.tune_regressor(print_out_tune=True)
must be set to create README file. Getting feature data... Performing feature selection... Executing hyperparameter tuning... Number of features: 1 Lasso: R2: 0.120 Number of features: 2 Lasso: R2: 0.785 Number of features: 3 Lasso: R2: 0.805 Number of features: 4 Lasso: R2: 0.805 Number of features: 5 Lasso: R2: 0.816 Number of features: 4 Lasso: R2: 0.810 Number of features: 3 Lasso: R2: 0.808
tnigon commented 4 years ago

To do yet:

Add training functionality to the tuning class.

tnigon commented 4 years ago

Because this class performs training, it was renamed to Training. It's only function (as of now) is train(), which first executes hyperparameter tuning and saves results to df_tune, then trains the estimator and creates df_train for each number of features.

To do

Getting closer to closing this issue, and will do so when df_test_preds is added. Then add graphing/plotting as a separate issue.

tnigon commented 4 years ago

Seems to be working as intended, and have unit tests running to get full code coverage. There is not yet the ability to flag if a tuning and/or test has already been performed by this Training instance, so there is a possibility of having [almost] duplicate rows in df_tune and df_test ("uid" and index will be different).

Column data from df_pred can be indexed according the the "uid" column in df_test and df_test_filtered.

Use:

from research_tools import feature_groups
from research_tools import Training

my_train = Training(param_dict=feature_groups.param_dict_test, print_out=False)
my_train.train()
must be set to create README file. Getting feature data... Performing feature selection... Executing hyperparameter tuning and estimator training...