IIIS-Li-Group / OpenFE

OpenFE: automated feature generation with expert-level performance
MIT License
781 stars 99 forks source link

pyarrow.lib.ArrowInvalid: Field named <column_name> is not found #51

Open bencoldham opened 5 months ago

bencoldham commented 5 months ago

The following code produced an error on the transform function. The fit function works correctly. This error is reproduced for every feature in the original dataset.

X = df.drop(["Target"], axis=1)
y = df["Target"]

ofe = OpenFE()
ofe.fit(data=X, label=y, categorical_features=cat_cols, n_jobs=11)

train_x, test_x = transform(X, test, ofe.new_features_list[:50], n_jobs=11 )
Traceback (most recent call last):
  File "/home/ben/projects/kaggle/.venv/lib/python3.10/site-packages/openfe/utils.py", line 102, in _cal
    _data = pd.read_feather('./openfe_tmp_data.feather', columns=base_features).set_index('openfe_index')
  File "/home/ben/projects/kaggle/.venv/lib/python3.10/site-packages/pandas/io/feather_format.py", line 124, in read_feather
    return feather.read_feather(
  File "/home/ben/projects/kaggle/.venv/lib/python3.10/site-packages/pyarrow/feather.py", line 226, in read_feather
    return (read_table(
  File "/home/ben/projects/kaggle/.venv/lib/python3.10/site-packages/pyarrow/feather.py", line 262, in read_table
    table = reader.read_names(columns)
  File "pyarrow/_feather.pyx", line 114, in pyarrow._feather.FeatherReader.read_names
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Field named evaluations is not found
ZhangTP1996 commented 5 months ago

Could you please provide a minimal reproducible example with code and data so that I can fix it?

scar-fx commented 4 months ago

Could you please provide a minimal reproducible example with code and data so that I can fix it?

i have same issue here is how my data looks .

<class 'pandas.core.frame.DataFrame'> RangeIndex: 76368 entries, 0 to 76367 Data columns (total 36 columns):

Column Non-Null Count Dtype


0 Marital status 76368 non-null int64
1 Application mode 76368 non-null int64
2 Application order 76368 non-null int64
3 Course 76368 non-null int64
4 Daytime/evening attendance 76368 non-null int64
5 Previous qualification 76368 non-null int64
6 Previous qualification (grade) 76368 non-null float64 7 Nacionality 76368 non-null int64
8 Mother's qualification 76368 non-null int64
9 Father's qualification 76368 non-null int64
10 Mother's occupation 76368 non-null int64
11 Father's occupation 76368 non-null int64
12 Admission grade 76368 non-null float64 13 Displaced 76368 non-null int64
14 Educational special needs 76368 non-null int64
15 Debtor 76368 non-null int64
16 Tuition fees up to date 76368 non-null int64
17 Gender 76368 non-null int64
18 Scholarship holder 76368 non-null int64
19 Age at enrollment 76368 non-null int64
20 International 76368 non-null int64
21 Curricular units 1st sem (credited) 76368 non-null int64
22 Curricular units 1st sem (enrolled) 76368 non-null int64
23 Curricular units 1st sem (evaluations) 76368 non-null int64
24 Curricular units 1st sem (approved) 76368 non-null int64
25 Curricular units 1st sem (grade) 76368 non-null float64 26 Curricular units 1st sem (without evaluations) 76368 non-null int64
27 Curricular units 2nd sem (credited) 76368 non-null int64
28 Curricular units 2nd sem (enrolled) 76368 non-null int64
29 Curricular units 2nd sem (evaluations) 76368 non-null int64
30 Curricular units 2nd sem (approved) 76368 non-null int64
31 Curricular units 2nd sem (grade) 76368 non-null float64 32 Curricular units 2nd sem (without evaluations) 76368 non-null int64
33 Unemployment rate 76368 non-null float64 34 Inflation rate 76368 non-null float64 35 GDP 76368 non-null float64 dtypes: float64(7), int64(29) memory usage: 21.0 MB

the is also a target column that is on it own .

if you need more ifno let know and i will provide it thanks .

scar-fx commented 4 months ago

here is the full error message : raceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/openfe/utils.py", line 102, in _cal _data = pd.read_feather('./openfe_tmp_data.feather', columns=base_features).set_index('openfe_index') File "/opt/conda/lib/python3.10/site-packages/pandas/io/feather_format.py", line 124, in read_feather return feather.read_feather( File "/opt/conda/lib/python3.10/site-packages/pyarrow/feather.py", line 226, in read_feather return (read_table( File "/opt/conda/lib/python3.10/site-packages/pyarrow/feather.py", line 262, in read_table table = reader.read_names(columns) File "pyarrow/_feather.pyx", line 114, in pyarrow._feather.FeatherReader.read_names File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Field named credited is not found


_RemoteTraceback Traceback (most recent call last) _RemoteTraceback: """ Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/openfe/utils.py", line 102, in _cal _data = pd.read_feather('./openfe_tmp_data.feather', columns=base_features).set_index('openfe_index') File "/opt/conda/lib/python3.10/site-packages/pandas/io/feather_format.py", line 124, in read_feather return feather.read_feather( File "/opt/conda/lib/python3.10/site-packages/pyarrow/feather.py", line 226, in read_feather return (read_table( File "/opt/conda/lib/python3.10/site-packages/pyarrow/feather.py", line 262, in read_table table = reader.read_names(columns) File "pyarrow/_feather.pyx", line 114, in pyarrow._feather.FeatherReader.read_names File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Field named credited is not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/conda/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/opt/conda/lib/python3.10/site-packages/openfe/utils.py", line 111, in _cal exit() NameError: name 'exit' is not defined """

The above exception was the direct cause of the following exception:

NameError Traceback (most recent call last) Cell In[62], line 2 1 #transform the train and test data according to generated features. ----> 2 f_x_train, f_x_val = transform(x_train, x_val, features, n_jobs=1)

File /opt/conda/lib/python3.10/site-packages/openfe/utils.py:147, in transform(X_train, X_test, new_features_list, n_jobs, name) 145 cat_feats = [] 146 for i, res in enumerate(results): --> 147 is_cat, d1, d2, f = res.result() 148 names.append('autoFEf%d' % i + name) 149 names_map['autoFEf%d' % i + name] = f

File /opt/conda/lib/python3.10/concurrent/futures/_base.py:451, in Future.result(self, timeout) 449 raise CancelledError() 450 elif self._state == FINISHED: --> 451 return self.__get_result() 453 self._condition.wait(timeout) 455 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

File /opt/conda/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self) 401 if self._exception: 402 try: --> 403 raise self._exception 404 finally: 405 # Break a reference cycle with the exception in self._exception 406 self = None

NameError: name 'exit' is not defined

ZhangTP1996 commented 4 months ago

It seems that the error comes from _data = pd.read_feather('./openfe_tmp_data.feather', columns=base_features).set_index('openfe_index'). But credited is not in the columns. Are you trying to run multiple openfe processes in the same machine?

scar-fx commented 4 months ago

only one openfe process am runing it on kaggle notebook .