IIIS-Li-Group / OpenFE

OpenFE: automated feature generation with expert-level performance
MIT License
781 stars 99 forks source link

A process in the process pool was terminated abruptly while the future was running or pending #13

Closed Totorosummer closed 1 year ago

Totorosummer commented 1 year ago

Hi, thanks for the sharing. And while I tried OpenFE, there was one error:

my code: openfe_feature = ofe.fit(data = train_df, label = label_df['label'],n_jobs = 1)

and the error is : BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

The error page shows it may happened in line 708 of openfe.py res.extend(r.result()), and it seemed that the result() function cannot get the correct result and broke down.

I tried to reduce the dimension of my data, like 10 samples and 10 potential features. But still have the same error.

The python I used is 3.9.7, and the environment is Linux.

ZhangTP1996 commented 1 year ago

Hi, thanks for the sharing. And while I tried OpenFE, there was one error:

my code: openfe_feature = ofe.fit(data = train_df, label = label_df['label'],n_jobs = 1)

and the error is : BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

The error page shows it may happened in line 708 of openfe.py res.extend(r.result()), and it seemed that the result() function cannot get the correct result and broke down.

I tried to reduce the dimension of my data, like 10 samples and 10 potential features. But still have the same error.

The python I used is 3.9.7, and the environment is Linux.

There should be some other errors in the process that cause the BrokenProcessPool. Are there any reports on other errors? Besides, can you successfullly run the example code?

Totorosummer commented 1 year ago

Thanks for responding so quickly!

The California Housing dataset in example seems have some issue, when I ran the example code, it showed 'Network is unreachable'. And the data website showed 403 Forbidden. I am trying to download the dataset now and run the example code again.

For the process error, I didn't see any other special error report. After the res.extend(r.result()) in openfe.py file, the report showed the some issues of python base file: File "~/python3.9/concurrent/futures/_base.py", line 445, in result(self, timeout) return self.__get_result() File "~/python3.9/concurrent/futures/_base.py", line 390, in __get_result(self) raise self._exception BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Totorosummer commented 1 year ago

For the example code, I tried the California Housing example, I tried and it run successfully. Not sure why it failed for my own data.

Totorosummer commented 1 year ago

Hi, I think I solve it. Add if __name__ =='__main__:' openfe_feature = ofe.for(****) works now.

Some errors still occurs may because of my label is binary value (0/1). Wonder is there any parameters should be set for the binary regression?

ZhangTP1996 commented 1 year ago

Hi, I think I solve it. Add if __name__ =='__main__:' openfe_feature = ofe.for(****) works now.

Some errors still occurs may because of my label is binary value (0/1). Wonder is there any parameters should be set for the binary regression?

OpenFE should be able to automatically handle binary-classification problems. What is the error and what is binary regression?

Totorosummer commented 1 year ago

Sorry for my typing error, it should be binary-classification. Please let me know if I should close this issue and open a new one.

For the example (California Hosing problem), if I change the label into binary-classification, it will have error: Input contains NaN, or infinity or too large values than float type .

The change I made is replace the train_Y into pd.DataFrame([0 if x <=2.5 else 1 for x in train_Y['MedHouseVal']])

The changed label only has 0 or 1, and all samples has its label. But it still shows input error.

ZhangTP1996 commented 1 year ago

Sorry for my typing error, it should be binary-classification. Please let me know if I should close this issue and open a new one.

For the example (California Hosing problem), if I change the label into binary-classification, it will have error: Input contains NaN, or infinity or too large values than float type .

The change I made is replace the train_Y into pd.DataFrame([0 if x <=2.5 else 1 for x in train_Y['MedHouseVal']])

The changed label only has 0 or 1, and all samples has its label. But it still shows input error.

This is probably because the train_Y no longer has the same index as the train_X. pd.DataFrame([0 if x <=2.5 else 1 for x in train_Y['MedHouseVal']], index=train_Y.index) should be correct.

Totorosummer commented 1 year ago

I tried to fix the index problems, but still the same same error. Here is my code and error report:

` if name =='main':

n_jobs =4
data = fetch_california_housing(as_frame = True).frame
label = data[['MedHouseVal']]
del data['medHouseVal']
train_x, test_x, train_y, test_y = train_test_split(data, label, test_size=0.2, random_state=1)
label_new_df = pd.DataFrame([0 if x <=2.5 else 1 for x in train_Y['MedHouseVal']],columns=['new_label'])
label_new_df.index = train_y.index
ofe = openfe()
ofe.fit(data = train_x,label = label_new_df,n_jobs = n_jobs)`

and the error report is same as before: Input contains NaN, infinity or a value too large for dtype('float64')

ZhangTP1996 commented 1 year ago

n_jobs =4 data = fetch_california_housing(as_frame = True).frame label = data[['MedHouseVal']] del data['medHouseVal'] train_x, test_x, train_y, test_y = train_test_split(data, label, test_size=0.2, random_state=1) label_new_df = pd.DataFrame([0 if x <=2.5 else 1 for x in train_Y['MedHouseVal']],columns=['new_label']) label_new_df.index = train_y.index ofe = openfe() ofe.fit(data = train_x,label = label_new_df,n_jobs = n_jobs)

I can run the following code successfully. Two small difference del data['medHouseVal'] -> del data['MedHouseVal'] and train_Y['MedHouseVal'] -> train_y['MedHouseVal']

n_jobs =4
data = fetch_california_housing(as_frame = True).frame
label = data[['MedHouseVal']]
del data['medHouseVal']
train_x, test_x, train_y, test_y = train_test_split(data, label, test_size=0.2, random_state=1)
label_new_df = pd.DataFrame([0 if x <=2.5 else 1 for x in train_Y['MedHouseVal']],columns=['new_label'])
label_new_df.index = train_y.index
ofe = openfe()
ofe.fit(data = train_x,label = label_new_df,n_jobs = n_jobs)
Totorosummer commented 1 year ago

Hi I solve the issue. Add one line:

label_new_df['new_label'] = label_new_df['new_label'].astype(np.float64)

Thanks for your time. I will close this issue.