fireducks-dev / fireducks

Create an issue on FireDucks
108 stars 4 forks source link

Not working On Catboost #21

Open mehdii190 opened 2 weeks ago

mehdii190 commented 2 weeks ago

KeyError Traceback (most recent call last) in <cell line: 11>() 10 11 cat_model.fit( ---> 12 Pool(x_train, y_train), 13 early_stopping_rounds=20, 14 verbose=100

15 frames /usr/local/lib/python3.10/dist-packages/catboost/core.py in init(self, data, label, cat_features, text_features, embedding_features, embedding_features_data, column_description, pairs, graph, delimiter, has_header, ignore_csv_quoting, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, timestamp, feature_names, feature_tags, thread_count, log_cout, log_cerr, data_can_be_none) 853 ) 854 --> 855 self._init(data, label, cat_features, text_features, embedding_features, embedding_features_data, pairs, graph, weight, 856 group_id, group_weight, subgroup_id, pairs_weight, baseline, timestamp, feature_names, feature_tags, thread_count) 857 elif not data_can_be_none:

/usr/local/lib/python3.10/dist-packages/catboost/core.py in _init(self, data, label, cat_features, text_features, embedding_features, embedding_features_data, pairs, graph, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, timestamp, feature_names, feature_tags, thread_count) 1489 if feature_tags is not None: 1490 feature_tags = self._check_transform_tags(feature_tags, feature_names) -> 1491 self._init_pool(data, label, cat_features, text_features, embedding_features, embedding_features_data, pairs, graph, weight, 1492 group_id, group_weight, subgroup_id, pairs_weight, baseline, timestamp, feature_names, feature_tags, thread_count) 1493

_catboost.pyx in _catboost._PoolBase._init_pool()

_catboost.pyx in _catboost._PoolBase._init_pool()

_catboost.pyx in _catboost._PoolBase._init_objects_order_layout_pool()

_catboost.pyx in _catboost._set_data()

_catboost.pyx in _catboost._set_data_from_generic_matrix()

/usr/local/lib/python3.10/dist-packages/fireducks/pandas/series.py in getitem(self, key) 793 794 reason = f"Unsupported key type: {type(key)}" --> 795 return self._fallback_call( 796 "getitem", key, __fireducks_reason=reason 797 )

/usr/local/lib/python3.10/dist-packages/fireducks/pandas/generic.py in _fallback_call(self, _FireDucksPandasCompatfireducks_method, *args, *kwargs) 357 def _fallback_call(self, __fireducks_method, args, **kwargs): 358 reason = kwargs.pop("fireducks_reason", None) --> 359 return utils.fallback_call_packed( 360 self._unwrap, __fireducks_method, args, kwargs, reason=reason 361 )

/usr/local/lib/python3.10/dist-packages/fireducks/pandas/utils.py in fallback_call_packed(fallbacker, method, args, kwargs, reason, stacklevel) 166 ): 167 options = fireducks.core.get_fireducks_options() --> 168 return ff.fallback_call( 169 fallbacker, 170 method,

/usr/local/lib/python3.10/dist-packages/fireducks/fallback.py in fallback_call(fallbacker, method, args, kwargs, reason, stacklevel, wrap_func, unwrap_func, log_lineno, warn_fallback) 205 stacklevel: Level from user code to warnings.warn 206 """ --> 207 method = fallback_attr( 208 fallbacker, 209 method,

/usr/local/lib/python3.10/dist-packages/fireducks/fallback.py in fallback_attr(fallbacker, name, reason, stacklevel, wrap_func, unwrap_func, log_lineno, warn_fallback) 125 126 with warn_builder.timing("getobj"): --> 127 obj = fallbacker(reason=reason) 128 129 logger.info(

/usr/local/lib/python3.10/dist-packages/fireducks/pandas/generic.py in _unwrap(self, reason) 341 342 def _unwrap(self, reason=None): --> 343 return self.to_pandas(reason=f"unwrap ({reason})") 344 345 def _get_fallback(self, inplace):

/usr/local/lib/python3.10/dist-packages/fireducks/pandas/generic.py in to_pandas(self, options, reason) 1252 f"{self.class.name}._to_pandas was called" 1253 ) -> 1254 self._fireducks_meta.set_cache(self._to_pandas(options=options)) 1255 1256 return self._fireducks_meta.get_cache()

/usr/local/lib/python3.10/dist-packages/fireducks/pandas/series.py in _to_pandas(self, options) 741 from fireducks.pandas.frame import _to_pandas_frame_metadata 742 --> 743 result = _to_pandas_frame_metadata(self._value, options) 744 assert ( 745 isinstance(result, pandas.DataFrame)

/usr/local/lib/python3.10/dist-packages/fireducks/pandas/frame.py in _to_pandas_frame_metadata(value, options) 1893 def _to_pandas_frame_metadata(value, options=None): 1894 v0, v1 = ir.to_pandas_frame_metadata(value) -> 1895 df, meta = fireducks.core.evaluate([v0, v1], options=options) 1896 logger.debug("to_pandas_frame_metadata: meta=%s", meta) 1897 with tracing.scope(tracing.Level.VERBOSE, "to_pandas:metadata.apply"):

/usr/local/lib/python3.10/dist-packages/fireducks/core.py in evaluate(values, options, evalLogger) 364 ): 365 with tracing.scope(tracing.Level.DEFAULT, "fireducks.core.evaluate"): --> 366 ret = _evaluate(values, options, evalLogger) 367 return ret 368

/usr/local/lib/python3.10/dist-packages/fireducks/core.py in _evaluate(values, options, evalLogger) 343 344 try: --> 345 return fire.evaluate(values, wrapper, package="fireducks") 346 except fireducks_ext.IndexingError as e: 347 raise pandas.errors.IndexingError(e)

/usr/local/lib/python3.10/dist-packages/firefw/runtime.py in wrapper(*args, *kwargs) 29 lock = True 30 try: ---> 31 ret = func(args, **kwargs) 32 finally: 33 lock = False

/usr/local/lib/python3.10/dist-packages/firefw/runtime.py in evaluate(values, executor, package) 70 ) 71 ---> 72 ret = executor(source, input_values, output_values) 73 74 if len(output_values) != len(ret):

/usr/local/lib/python3.10/dist-packages/fireducks/core.py in wrapper(ir, input_values, output_values) 335 fi.output_types = [v.mlir_type for v in output_values] 336 --> 337 return fireducks_ext.execute( 338 context().ext_context, 339 options._compile_options,

KeyError: '0'

I think it needs to be fixed. I hope you will fix the bugs.

qsourav commented 1 week ago

Hi @mehdii190 ,

Thank you very much for reporting the issue.

While investigating the root cause, we found it as a very special case. In order to pass a fireducks instance to an external library like catboost etc., we need to apply monkey-patch such that "import pandas" would automatically act as "import fireducks.pandas" in those external libraries.

For catboost library, it seems like to code flow passes through a .pyx implementation which imports pandas. The current monkey-patch will work for normal .py files, but it seems not to be easy for .pyx files. Hence, the following isinstance check fails: https://github.com/catboost/catboost/blob/master/catboost/python-package/catboost/_catboost.pyx#L4381

And the code falls into the following else part (ideally it should be executed via if-part): https://github.com/catboost/catboost/blob/master/catboost/python-package/catboost/_catboost.pyx#L4409

It seems to be difficult to support such case quickly. For your case, what is the type of x_train, x_test? If these are DataFrame, Series instances, can you please try as following instead?

Pool(x_train.values, y_train.values),