IIIS-Li-Group / OpenFE

OpenFE: automated feature generation with expert-level performance
MIT License
781 stars 99 forks source link

pyarrow.lib.ArrowInvalid: Field named proBNP is not found #35

Closed lisp1 closed 1 year ago

lisp1 commented 1 year ago

When running train_x, test_x = transform(train_x, test_x, features, n_jobs=16), the following error occurs. The same error occurs on both environment based on python 3.9 and python 3.11. When changing the number of n_jobs, "pyarrow.lib.ArrowInvalid: Field named proBNP is not found" may change to "pyarrow.lib.ArrowInvalid: Field named NT is not found". It will be awesome if you can help solve this, thank you!

_RemoteTraceback Traceback (most recent call last) _RemoteTraceback: """ Traceback (most recent call last): File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\openfe\utils.py", line 102, in _cal _data = pd.read_feather('./openfe_tmp_data.feather', columns=base_features).set_index('openfe_index') File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\pandas\io\feather_format.py", line 126, in read_feather return feather.read_feather( File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\pyarrow\feather.py", line 226, in read_feather return (read_table( File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\pyarrow\feather.py", line 262, in read_table table = reader.read_names(columns) File "pyarrow_feather.pyx", line 114, in pyarrow._feather.FeatherReader.read_names File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Field named proBNP is not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "D:\anaconda\envs\pytorch_gpu\lib\concurrent\futures\process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\openfe\utils.py", line 111, in _cal exit() File "D:\anaconda\envs\pytorch_gpu\lib_sitebuiltins.py", line 26, in call raise SystemExit(code) SystemExit: None """

The above exception was the direct cause of the following exception:

SystemExit Traceback (most recent call last) [... skipping hidden 1 frame]

Cell In[26], line 1 ----> 1 train_x, test_x = transform(train_x, test_x, features, n_jobs=4)

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\openfe\utils.py:147, in transform(X_train, X_test, new_features_list, n_jobs, name) 146 for i, res in enumerate(results): --> 147 is_cat, d1, d2, f = res.result() 148 names.append('autoFEf%d' % i + name)

File D:\anaconda\envs\pytorch_gpu\lib\concurrent\futures_base.py:439, in Future.result(self, timeout) 438 elif self._state == FINISHED: --> 439 return self.__get_result() 441 self._condition.wait(timeout)

File D:\anaconda\envs\pytorch_gpu\lib\concurrent\futures_base.py:391, in Future.__get_result(self) 390 try: --> 391 raise self._exception 392 finally: 393 # Break a reference cycle with the exception in self._exception

SystemExit: None

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last) [... skipping hidden 1 frame]

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\interactiveshell.py:2095, in InteractiveShell.showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code) 2092 if exception_only: 2093 stb = ['An exception has occurred, use %tb to see ' 2094 'the full traceback.\n'] -> 2095 stb.extend(self.InteractiveTB.get_exception_only(etype, 2096 value)) 2097 else: 2098 try: 2099 # Exception classes can customise their traceback - we 2100 # use this in IPython.parallel for exceptions occurring 2101 # in the engines. This should return a list of strings.

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:696, in ListTB.get_exception_only(self, etype, value) 688 def get_exception_only(self, etype, value): 689 """Only print the exception type and message, without a traceback. 690 691 Parameters (...) 694 value : exception value 695 """ --> 696 return ListTB.structured_traceback(self, etype, value)

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:559, in ListTB.structured_traceback(self, etype, evalue, etb, tb_offset, context) 556 chained_exc_ids.add(id(exception[1])) 557 chained_exceptions_tb_offset = 0 558 out_list = ( --> 559 self.structured_traceback( 560 etype, 561 evalue, 562 (etb, chained_exc_ids), # type: ignore 563 chained_exceptions_tb_offset, 564 context, 565 ) 566 + chained_exception_message 567 + out_list) 569 return out_list

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1396, in AutoFormattedTB.structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context) 1394 else: 1395 self.tb = etb -> 1396 return FormattedTB.structured_traceback( 1397 self, etype, evalue, etb, tb_offset, number_of_lines_of_context 1398 )

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1287, in FormattedTB.structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context) 1284 mode = self.mode 1285 if mode in self.verbose_modes: 1286 # Verbose modes need a full traceback -> 1287 return VerboseTB.structured_traceback( 1288 self, etype, value, tb, tb_offset, number_of_lines_of_context 1289 ) 1290 elif mode == 'Minimal': 1291 return ListTB.get_exception_only(self, etype, value)

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1140, in VerboseTB.structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context) 1131 def structured_traceback( 1132 self, 1133 etype: type, (...) 1137 number_of_lines_of_context: int = 5, 1138 ): 1139 """Return a nice text document describing the traceback.""" -> 1140 formatted_exception = self.format_exception_as_a_whole(etype, evalue, etb, number_of_lines_of_context, 1141 tb_offset) 1143 colors = self.Colors # just a shorthand + quicker name lookup 1144 colorsnormal = colors.Normal # used a lot

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1030, in VerboseTB.format_exception_as_a_whole(self, etype, evalue, etb, number_of_lines_of_context, tb_offset) 1027 assert isinstance(tb_offset, int) 1028 head = self.prepare_header(str(etype), self.long_header) 1029 records = ( -> 1030 self.get_records(etb, number_of_lines_of_context, tb_offset) if etb else [] 1031 ) 1033 frames = [] 1034 skipped = 0

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1098, in VerboseTB.get_records(self, etb, number_of_lines_of_context, tb_offset) 1096 while cf is not None: 1097 try: -> 1098 mod = inspect.getmodule(cf.tb_frame) 1099 if mod is not None: 1100 mod_name = mod.name

AttributeError: 'tuple' object has no attribute 'tb_frame'

lisp1 commented 1 year ago

Considering that this might be a problem relevant to io of the feather files, I use csv instead, but similar error about NT and proBNP still occurs:

_RemoteTraceback Traceback (most recent call last) _RemoteTraceback: """ Traceback (most recent call last): File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\openfe\utils.py", line 103, in _cal _data = pd.read_csv('./openfe_tmp_data.csv', usecols=base_features).set_index('openfe_index') File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\pandas\io\parsers\readers.py", line 948, in read_csv return _read(filepath_or_buffer, kwds) File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\pandas\io\parsers\readers.py", line 611, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\pandas\io\parsers\readers.py", line 1448, in init self._engine = self._make_engine(f, self.engine) File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\pandas\io\parsers\readers.py", line 1723, in _make_engine return mapping[engine](f, self.options) File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 140, in init self._validate_usecols_names(usecols, self.orig_names) File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\pandas\io\parsers\base_parser.py", line 969, in _validate_usecols_names raise ValueError( ValueError: Usecols do not match columns, columns expected but not found: ['proBNP', 'NT']

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "D:\anaconda\envs\pytorch_gpu\lib\concurrent\futures\process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\openfe\utils.py", line 112, in _cal exit() File "D:\anaconda\envs\pytorch_gpu\lib_sitebuiltins.py", line 26, in call raise SystemExit(code) SystemExit: None """

The above exception was the direct cause of the following exception:

SystemExit Traceback (most recent call last) [... skipping hidden 1 frame]

Cell In[15], line 1 ----> 1 train_x, test_x = transform(train_x, test_x, features, n_jobs=16)

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\openfe\utils.py:149, in transform(X_train, X_test, new_features_list, n_jobs, name) 148 for i, res in enumerate(results): --> 149 is_cat, d1, d2, f = res.result() 150 names.append('autoFEf%d' % i + name)

File D:\anaconda\envs\pytorch_gpu\lib\concurrent\futures_base.py:439, in Future.result(self, timeout) 438 elif self._state == FINISHED: --> 439 return self.__get_result() 441 self._condition.wait(timeout)

File D:\anaconda\envs\pytorch_gpu\lib\concurrent\futures_base.py:391, in Future.__get_result(self) 390 try: --> 391 raise self._exception 392 finally: 393 # Break a reference cycle with the exception in self._exception

SystemExit: None

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last) [... skipping hidden 1 frame]

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\interactiveshell.py:2095, in InteractiveShell.showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code) 2092 if exception_only: 2093 stb = ['An exception has occurred, use %tb to see ' 2094 'the full traceback.\n'] -> 2095 stb.extend(self.InteractiveTB.get_exception_only(etype, 2096 value)) 2097 else: 2098 try: 2099 # Exception classes can customise their traceback - we 2100 # use this in IPython.parallel for exceptions occurring 2101 # in the engines. This should return a list of strings.

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:696, in ListTB.get_exception_only(self, etype, value) 688 def get_exception_only(self, etype, value): 689 """Only print the exception type and message, without a traceback. 690 691 Parameters (...) 694 value : exception value 695 """ --> 696 return ListTB.structured_traceback(self, etype, value)

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:559, in ListTB.structured_traceback(self, etype, evalue, etb, tb_offset, context) 556 chained_exc_ids.add(id(exception[1])) 557 chained_exceptions_tb_offset = 0 558 out_list = ( --> 559 self.structured_traceback( 560 etype, 561 evalue, 562 (etb, chained_exc_ids), # type: ignore 563 chained_exceptions_tb_offset, 564 context, 565 ) 566 + chained_exception_message 567 + out_list) 569 return out_list

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1396, in AutoFormattedTB.structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context) 1394 else: 1395 self.tb = etb -> 1396 return FormattedTB.structured_traceback( 1397 self, etype, evalue, etb, tb_offset, number_of_lines_of_context 1398 )

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1287, in FormattedTB.structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context) 1284 mode = self.mode 1285 if mode in self.verbose_modes: 1286 # Verbose modes need a full traceback -> 1287 return VerboseTB.structured_traceback( 1288 self, etype, value, tb, tb_offset, number_of_lines_of_context 1289 ) 1290 elif mode == 'Minimal': 1291 return ListTB.get_exception_only(self, etype, value)

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1140, in VerboseTB.structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context) 1131 def structured_traceback( 1132 self, 1133 etype: type, (...) 1137 number_of_lines_of_context: int = 5, 1138 ): 1139 """Return a nice text document describing the traceback.""" -> 1140 formatted_exception = self.format_exception_as_a_whole(etype, evalue, etb, number_of_lines_of_context, 1141 tb_offset) 1143 colors = self.Colors # just a shorthand + quicker name lookup 1144 colorsnormal = colors.Normal # used a lot

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1030, in VerboseTB.format_exception_as_a_whole(self, etype, evalue, etb, number_of_lines_of_context, tb_offset) 1027 assert isinstance(tb_offset, int) 1028 head = self.prepare_header(str(etype), self.long_header) 1029 records = ( -> 1030 self.get_records(etb, number_of_lines_of_context, tb_offset) if etb else [] 1031 ) 1033 frames = [] 1034 skipped = 0

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1098, in VerboseTB.get_records(self, etb, number_of_lines_of_context, tb_offset) 1096 while cf is not None: 1097 try: -> 1098 mod = inspect.getmodule(cf.tb_frame) 1099 if mod is not None: 1100 mod_name = mod.name

AttributeError: 'tuple' object has no attribute 'tb_frame'

ZhangTP1996 commented 1 year ago

I have two questions that may help identify the bug. 1. Are proBNP or NT the features of the dataset? 2. Are you running multiple openfe on different datasets at the same time?

lisp1 commented 1 year ago

Hey ZhangTP1996, thanks for your prompt and helpful reply! There is one feature in my dataset which named "NT-proBNP". After renaming it to proBNP, the problem is solved. It turns out that "-" in the feature name could cause such problem. After further testing, I also noticed that in some case, "(", ")" and "/" in feature name could also cause such problem. For example, a feature whose name contains "g/L" results in an error of 'Field named L is not found" (feather version) or "Usecols do not match columns, columns expected but not found: ['L']" (csv version). This should provide some hints for further improvement.

FitHubWHY commented 8 months ago

Also encountered this issue, as the column name contains "-"

xxllp commented 5 months ago

the same question