amjams / FeatBoost

Boosted Iterative Input Selection
20 stars 6 forks source link

Errors thrown when executing FeatBoost on 6-feature dataset #1

Open dunnkers opened 4 years ago

dunnkers commented 4 years ago

Input is a 6-feature dataset, found here. FeatBoost is executed using the following setup:

    # Setup estimator
    xgboost_ensemble = XGBClassifier(max_depth=3, learning_rate=0.1,\
        n_estimators=200, silent=True, objective='binary:logistic',\
        booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1,\
        max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,\
        reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5,\
        random_state=0, seed=None, missing=None)
    # Setup FS method
    fs = FeatBoostClassification(estimator=[xgboost_ensemble,\
        xgboost_ensemble, xgboost_ensemble], number_of_folds = 10,\
            siso_ranking_size = 8,\
            max_number_of_features = 100,\
            siso_order=4,\
            epsilon=1e-18,\
            verbose=2)

    # Run Feature Selection
    fs.fit(X, y)

(exactly the same setup as test.py)

  1. First, throws an error in a print message, with parameter verbose=2.

Screenshot 2020-05-11 at 21 04 12

Full error log

```shell (venv) ➜ feature-selection git:(master) ✗ env DEBUGPY_LAUNCHER_PORT=53859 /Users/dunnkers/git/feature-selection/venv/bin/python /Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/launcher /Users/dunnkers/git/feature-selection/jobs/run-featboost.py /Users/dunnkers/git/feature-selection/data/6_bit_mutliplexer Ranking pool [FeatBoost_XGBoost] Running pool... [4 workers, 1 datasets] Ranking features iteration 01 feature importances of all available feature: x_001 3.792205 x_003 3.277614 x_004 2.644713 x_002 2.451928 x_006 2.280755 x_005 2.112983 multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, **kwds)) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 47, in starmapstar return list(itertools.starmap(args[0], args[1])) File "/Users/dunnkers/git/feature-selection/jobs/ComputePool.py", line 22, in ranking_pool ranking = ranking_func(X, y) File "/Users/dunnkers/git/feature-selection/jobs/run-featboost.py", line 25, in FeatBoost_XGBoost fs.fit(X, y) File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 188, in fit return self._fit(X, Y) File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 272, in _fit selected_variable,best_acc_t = self._siso(X,Y,iteration_number) File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 396, in _siso ranking, self.all_ranking_ = self._input_ranking(X, Y, iteration_number) File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 559, in _input_ranking print("%s %05f" % (self._feature_names[feature_rank[i]], feature_importance[feature_rank[i]])) IndexError: index -7 is out of bounds for axis 0 with size 6 """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/__main__.py", line 45, in cli.main() File "/Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/../debugpy/server/cli.py", line 430, in main run() File "/Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/../debugpy/server/cli.py", line 267, in run_file runpy.run_path(options.target, run_name=compat.force_str("__main__")) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/dunnkers/git/feature-selection/jobs/run-featboost.py", line 36, in run_ranking_pool(FeatBoost_XGBoost) File "/Users/dunnkers/git/feature-selection/jobs/ComputePool.py", line 42, in run_ranking_pool run_pool(ranking_pool, 'ranking', ranking_func, ranking_method) File "/Users/dunnkers/git/feature-selection/jobs/ComputePool.py", line 99, in run_pool pool_results = pool.starmap(func, pool_args) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 276, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get raise self._value IndexError: index -7 is out of bounds for axis 0 with size 6 ```

  1. Second, with verbose=1, another error is thrown.

Screenshot 2020-05-11 at 21 11 39

Full error log

```shell (venv) ➜ feature-selection git:(master) ✗ env DEBUGPY_LAUNCHER_PORT=53886 /Users/dunnkers/git/feature-selection/venv/bin/python /Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/launcher /Users/dunnkers/git/feature-selection/jobs/run-featboost.py /Users/dunnkers/git/feature-selection/data/6_bit_mutliplexer Ranking pool [FeatBoost_XGBoost] Running pool... [4 workers, 1 datasets] Ranking features iteration 01 multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, **kwds)) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 47, in starmapstar return list(itertools.starmap(args[0], args[1])) File "/Users/dunnkers/git/feature-selection/jobs/ComputePool.py", line 22, in ranking_pool ranking = ranking_func(X, y) File "/Users/dunnkers/git/feature-selection/jobs/run-featboost.py", line 25, in FeatBoost_XGBoost fs.fit(X, y) File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 188, in fit return self._fit(X, Y) File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 272, in _fit selected_variable,best_acc_t = self._siso(X,Y,iteration_number) File "/Users/dunnkers/git/feature-selection/jobs/lib/feat_boost.py", line 397, in _siso self.siso_ranking_[(iteration_number-1), :] = ranking ValueError: could not broadcast input array from shape (6) into shape (8) """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/__main__.py", line 45, in cli.main() File "/Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/../debugpy/server/cli.py", line 430, in main run() File "/Users/dunnkers/.vscode/extensions/ms-python.python-2020.4.76186/pythonFiles/lib/python/debugpy/wheels/debugpy/../debugpy/server/cli.py", line 267, in run_file runpy.run_path(options.target, run_name=compat.force_str("__main__")) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/dunnkers/git/feature-selection/jobs/run-featboost.py", line 36, in run_ranking_pool(FeatBoost_XGBoost) File "/Users/dunnkers/git/feature-selection/jobs/ComputePool.py", line 42, in run_ranking_pool run_pool(ranking_pool, 'ranking', ranking_func, ranking_method) File "/Users/dunnkers/git/feature-selection/jobs/ComputePool.py", line 99, in run_pool pool_results = pool.starmap(func, pool_args) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 276, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get raise self._value ValueError: could not broadcast input array from shape (6) into shape (8) ```

amjams commented 4 years ago

I don't know if these are the reasons behind the error, but consider the following: If it's a 6-feature dataset, then: siso_ranking_size = 8 This should be less or equal to 6. max_number_of_features = 100. Same as above.

reasonable values in this case would be 1 and 6 respectively.

dunnkers commented 4 years ago

That seems to explain the error- the 6-feature dataset now runs normally. I didn't know that siso_ranking_size should be <= # dataset features, maybe an assertion in the code and some docs would be nice.

What are reasonable values of siso_ranking_size I could use in my tests? The amount of features in the datasets range from 6 to 100000, so probably using a value of 8 is fine for all other datasets. I could also use a fixed value of 5, so I could use the same value for all tests.

amjams commented 4 years ago

yes, you're right. Some assertions would be helpful. You could use 5. But for larger datasets it might be helpful to increase it a bit. Just keep in mind how this can affect your runtime. I would say a good rule of thumb is to set it to 10 for datasets with over 100 features.