ThomasBury / arfs

All Relevant Feature Selection
MIT License
112 stars 12 forks source link

numpy.random.mtrand.RandomState.shuffle ValueError: array is read-only #48

Closed jmrichardson closed 1 month ago

jmrichardson commented 1 month ago

Hi,

I am testing GrootCV and got the following error:

 Cross Validation:   0%|          | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "D:\Anaconda3\envs\mld\lib\site-packages\IPython\core\interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-2bb4632a1898>", line 1, in <module>
    feat_selector.fit(X, y, sample_weight=None)
  File "D:\Anaconda3\envs\mld\lib\site-packages\arfs\feature_selection\allrelevant.py", line 2076, in fit
    self.selected_features_, self.cv_df, self.sha_cutoff = _reduce_vars_lgb_cv(
  File "D:\Anaconda3\envs\mld\lib\site-packages\arfs\feature_selection\allrelevant.py", line 2306, in _reduce_vars_lgb_cv
    new_x_tr, shadow_names = _create_shadow(X_train)
  File "D:\Anaconda3\envs\mld\lib\site-packages\arfs\feature_selection\allrelevant.py", line 1695, in _create_shadow
    np.random.shuffle(X_shadow[c].values)
  File "numpy\\random\\mtrand.pyx", line 4594, in numpy.random.mtrand.RandomState.shuffle
ValueError: array is read-only

Here is the respective code:

 ts_cv = TimeSeriesSplit(
            n_splits=3,
            gap=5,
        )

        feat_selector = GrootCV(
            objective="mse",
            n_folds=3,
            folds=ts_cv,
            n_iter=2,
            silent=True,
            fastshap=False,
            n_jobs=4,
        )
        feat_selector.fit(X, y, sample_weight=None)

In allreveant.py line 1696, I changed

np.random.shuffle(X_shadow[c].values)

to

X_shadow[c] = np.random.permutation(X_shadow[c].values)

It seems to work now. Hoping you could have a look.

Thanks!

ThomasBury commented 1 month ago

Hello @jmrichardson, could you print out the version of numpy and arfs you are using?

import arfs
print(f"numpy {np.__version__} and ARFS {arfs.__version__}")

As the error says, the array is read-only. It might be due how you instantiate X and y. A simple solution is copying your array or changing the numpy flag. Everything should be fine if you use pandas DF

Are you able to run the timeseries tuto? It runs fine with numpy 1.26.4, numpy 2.0.1 and ARFS 2.3.0

I prefer not to change shuffle to permutation, as permutation creates a copy of the numpy variable, which can be solved upward by instantiating X, y and w.

Let me know if that works, thanks for reaching out

jmrichardson commented 1 month ago

Hi @ThomasBury ,

Thank you for the fast reply!

import arfs
print(f"numpy {np.__version__} and ARFS {arfs.__version__}")
numpy 1.26.4 and ARFS 2.3.0

It fails on the tutorial. I just pasted the tutorial below in my python terminal and got the same error:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import cross_validate
from sklearn.model_selection import TimeSeriesSplit
from arfs.benchmark import highlight_tick
from arfs.feature_selection.allrelevant import GrootCV
bike_sharing = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True)
df = bike_sharing.frame
y = df["count"] #/ df["count"].max()
X = df.drop("count", axis="columns")
X["weather"] = (
    X["weather"]
    .astype(object)
    .replace(to_replace="heavy_rain", value="rain")
    .astype("category")
)
ts_cv = TimeSeriesSplit(
    n_splits=5,
    gap=48,
    max_train_size=10000,
    test_size=1000,
)
feat_selector = GrootCV(
    objective="poisson",
    cutoff=1,
    n_folds=5,
    folds=ts_cv,
    n_iter=5,
    silent=True,
    fastshap=False,
    n_jobs=0,
)
feat_selector.fit(X, y, sample_weight=None)
Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:27:34) [MSC v.1937 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.20.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 8.20.0
Cross Validation:   0%|          | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "D:\Anaconda3\envs\mld\lib\site-packages\IPython\core\interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-1-9ec4508f8aff>", line 39, in <module>
    feat_selector.fit(X, y, sample_weight=None)
  File "D:\Anaconda3\envs\mld\lib\site-packages\arfs\feature_selection\allrelevant.py", line 2077, in fit
    self.selected_features_, self.cv_df, self.sha_cutoff = _reduce_vars_lgb_cv(
  File "D:\Anaconda3\envs\mld\lib\site-packages\arfs\feature_selection\allrelevant.py", line 2307, in _reduce_vars_lgb_cv
    new_x_tr, shadow_names = _create_shadow(X_train)
  File "D:\Anaconda3\envs\mld\lib\site-packages\arfs\feature_selection\allrelevant.py", line 1696, in _create_shadow
    np.random.shuffle(X_shadow[c].values)
  File "numpy\\random\\mtrand.pyx", line 4594, in numpy.random.mtrand.RandomState.shuffle
ValueError: array is read-only

My X and y are pandas dataframe and series respectively. Ive added a .copy() to both X and y and got the same error:

feat_selector.fit(X.copy(), y.copy(), sample_weight=None)

Not sure what is different in our environments which could cause the issue?

ThomasBury commented 1 month ago

Alright, we can try two things:

Then run the tuto using this python kernel.

If it still fails, try to change the numpy flag (see the link in my previous message)

If none works, I'll need to investigate further. I just tested on two different laptops with fresh env, it works fine (linux and windows, numpy 1.26 and 2.01)

🤞

jmrichardson commented 1 month ago

Hi, creating a new environment did work. I tested both numpy 1.26 and 2.01 on my windows PC and no issue. There must be something else in my other environment that is conflicting. No worries, I will just create a fork and make the changes I need and hopefully have more time later to pin point the issue. Thanks for your help :)