Dlux804 / McQuade-Chem-ML

Development of easy to use and reproducible ML scripts for chemistry.
5 stars 1 forks source link

Fix param grid to be exported #54

Closed andreshyer closed 4 years ago

andreshyer commented 4 years ago

Describe the bug In development branch, the param grid, as well as other dict-type ojects, can not be exported into a json object.

To Reproduce Run lines 173-185 in models.py, and follow code leading to storage.py

Proposed solution Export the dicts objects (param_grid, params, etc.) as there own files. Or just another format to export and save data. Perhaps pickle objects would be useful?

Dlux804 commented 4 years ago

I recall that dicts can be exported. The issue with the param_grid is that the skopt framework uses its own data types that are not Java compatible.

andreshyer commented 4 years ago

If dict object can be exported, would it be possible to force the param_grid to become a dict? Would that break other parts of the code?

Dlux804 commented 4 years ago

I think it already is, more or less a dict. I think the issue is that the values inside the dict are strange types, like Integer() ranges instead of int(). For example:

bayes_grid = {
        'kernel': Categorical(['rbf', 'poly', 'linear']),
        'C': Real(10 ** -3, 10 ** 2, 'log-uniform'),
        'gamma': Real(10 ** -3, 10 ** 0, 'log-uniform'),
        'epsilon': Real(0.1, 0.6),
        'degree': Integer(1, 5)
    }
andreshyer commented 4 years ago

Yeah I noticed that the dict was not a normal dict. I added a little bit of code to try and debug

`for k, v in tqdm(d.items(), desc="Export to JSON", position=0):

print(k, type(v))

    if isinstance(v, pd.core.frame.DataFrame) or isinstance(v, pd.core.series.Series):
        objs.append(k)
        dfs.append(k)
        getattr(self, k).to_json(path_or_buf=self.run_name + '_' + k + '.json')

    if isinstance(v, dict):
        try:
            print(k, v)
            with open(self.run_name + '_' + k + '.json', 'w') as f:
                json.dumps(dict(v))
        except:
            print(f'FAIL {k} : {v}')
        objs.append(k)

    if not isinstance(v, (int, float, tuple, list, np.ndarray, bool, str, NoneType)):
        objs.append(k)`

And the following output comes from this

param_grid {'n_estimators': Integer(low=100, high=2000), 'max_features': Categorical(categories=('auto', 'sqrt'), prior=None), 'max_depth': Integer(low=1, high=30), 'min_samples_split': Integer(low=2, high=30), 'min_samples_leaf': Integer(low=2, high=30), 'bootstrap': Categorical(categories=(True, False), prior=None)} FAIL param_grid : {'n_estimators': Integer(low=100, high=2000), 'max_features': Categorical(categories=('auto', 'sqrt'), prior=None), 'max_depth': Integer(low=1, high=30), 'min_samples_split': Integer(low=2, high=30), 'min_samples_leaf': Integer(low=2, high=30), 'bootstrap': Categorical(categories=(True, False), prior=None)} params {'bootstrap': True, 'max_depth': 30, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100} predictions_stats {'r2_raw': array([0.8920112 , 0.89579143, 0.89603234, 0.89066064, 0.8926571 ]), 'r2_avg': 0.8934305428521252, 'r2_std': 0.002127357798190582, 'mse_raw': array([0.47072868, 0.45425046, 0.45320033, 0.47661584, 0.46791318]), 'mse_avg': 0.4645416995678035, 'mse_std': 0.009273261153887694, 'rmse_raw': array([0.6860967 , 0.67398105, 0.67320155, 0.6903737 , 0.6840418 ]), 'rmse_avg': 0.6815389605477208, 'rmse_std': 0.0068077032350010724, 'time_raw': array([2.16756701, 2.35749483, 2.13454795, 2.11755943, 2.14869523]), 'time_avg': 2.1851728916168214, 'time_std': 0.08771533743409417} FAIL predictions_stats : {'r2_raw': array([0.8920112 , 0.89579143, 0.89603234, 0.89066064, 0.8926571 ]), 'r2_avg': 0.8934305428521252, 'r2_std': 0.002127357798190582, 'mse_raw': array([0.47072868, 0.45425046, 0.45320033, 0.47661584, 0.46791318]), 'mse_avg': 0.4645416995678035, 'mse_std': 0.009273261153887694, 'rmse_raw': array([0.6860967 , 0.67398105, 0.67320155, 0.6903737 , 0.6840418 ]), 'rmse_avg': 0.6815389605477208, 'rmse_std': 0.0068077032350010724, 'time_raw': array([2.16756701, 2.35749483, 2.13454795, 2.11755943, 2.14869523]), 'time_avg': 2.1851728916168214, 'time_std': 0.08771533743409417} It is failing on param_gird, params, and prediciton_stats, which all have a werid format in the dicts

andreshyer commented 4 years ago

I do have a question. I see you are using pickle objects as a checkpoint between calcuating features can hypertuning. Could we use pickle objects to store this data?

andreshyer commented 4 years ago

You are right, the other dicts are saving to json files just fine, aftering passing the dict() command. Strange why params_gird is mis-behaving. How it is being generated?

Dlux804 commented 4 years ago

param_grid is generated in grid.py. Manual entry.