JDASoftwareGroup / kartothek

A consistent table management library in python
https://kartothek.readthedocs.io/en/stable
MIT License
161 stars 53 forks source link

Nightly tests fail with NotImplementedError on MultiIndex concat #464

Closed xhochy closed 3 years ago

xhochy commented 3 years ago

The nightlies are failing since a while due to a change in pandas. A git bisect points to https://github.com/pandas-dev/pandas/pull/38671

Detailed traceback of one of the tests:

``` 2021-04-23T14:05:30.8657876Z _____ test_update_shuffle_buckets[update_dataset_from_ddf-None-5-1-1-3-4] ______ 2021-04-23T14:05:30.8658312Z 2021-04-23T14:05:30.8661499Z store_factory = functools.partial(, 'hfs:///tmp/pytest-of-runner/pytest-0/test_update_shuffle_buckets_up3/store') 2021-04-23T14:05:30.8662697Z unique_primaries = 4, unique_secondaries = 3, num_buckets = 1, repartition = 1 2021-04-23T14:05:30.8663302Z npartitions = 5, bucket_by = None 2021-04-23T14:05:30.8663832Z func = 2021-04-23T14:05:30.8664183Z 2021-04-23T14:05:30.8664681Z @pytest.mark.parametrize("unique_primaries", [1, 4]) 2021-04-23T14:05:30.8665405Z @pytest.mark.parametrize("unique_secondaries", [1, 3]) 2021-04-23T14:05:30.8666110Z @pytest.mark.parametrize("num_buckets", [1, 5]) 2021-04-23T14:05:30.8666915Z @pytest.mark.parametrize("repartition", [1, 2]) 2021-04-23T14:05:30.8667626Z @pytest.mark.parametrize("npartitions", [5, 10]) 2021-04-23T14:05:30.8668357Z @pytest.mark.parametrize("bucket_by", [None, "sorted_column"]) 2021-04-23T14:05:30.8669197Z @pytest.mark.parametrize("func", [update_dataset_from_ddf, store_dataset_from_ddf]) 2021-04-23T14:05:30.8669873Z def test_update_shuffle_buckets( 2021-04-23T14:05:30.8670296Z store_factory, 2021-04-23T14:05:30.8670667Z unique_primaries, 2021-04-23T14:05:30.8671082Z unique_secondaries, 2021-04-23T14:05:30.8671470Z num_buckets, 2021-04-23T14:05:30.8671885Z repartition, 2021-04-23T14:05:30.8672264Z npartitions, 2021-04-23T14:05:30.8672603Z bucket_by, 2021-04-23T14:05:30.8672919Z func, 2021-04-23T14:05:30.8673192Z ): 2021-04-23T14:05:30.8673463Z """ 2021-04-23T14:05:30.8673958Z Assert that certain properties are always given for the output dataset 2021-04-23T14:05:30.8674629Z no matter how the input data distribution looks like 2021-04-23T14:05:30.8675052Z 2021-04-23T14:05:30.8675402Z Properties to assert: 2021-04-23T14:05:30.8675984Z * All partitions have a unique value for its correspondent primary key 2021-04-23T14:05:30.8676713Z * number of partitions is at least one per unique partition value, at 2021-04-23T14:05:30.8677357Z most ``num_buckets`` per primary partition value. 2021-04-23T14:05:30.8677967Z * If we demand a column to be sorted it is per partition monotonic 2021-04-23T14:05:30.8678432Z """ 2021-04-23T14:05:30.8678696Z 2021-04-23T14:05:30.8679114Z primaries = np.arange(unique_primaries) 2021-04-23T14:05:30.8679673Z secondary = np.arange(unique_secondaries) 2021-04-23T14:05:30.8680122Z num_rows = 100 2021-04-23T14:05:30.8680684Z primaries = np.repeat(primaries, np.ceil(num_rows / unique_primaries))[:num_rows] 2021-04-23T14:05:30.8681504Z secondary = np.repeat(secondary, np.ceil(num_rows / unique_secondaries))[:num_rows] 2021-04-23T14:05:30.8682366Z # ensure that there is an unsorted column uncorrelated 2021-04-23T14:05:30.8683056Z # to the primary and secondary columns which can be sorted later on per partition 2021-04-23T14:05:30.8683713Z unsorted_column = np.repeat(np.arange(100 / 10), 10) 2021-04-23T14:05:30.8684265Z np.random.shuffle(unsorted_column) 2021-04-23T14:05:30.8684816Z np.random.shuffle(primaries) 2021-04-23T14:05:30.8685339Z np.random.shuffle(secondary) 2021-04-23T14:05:30.8685740Z 2021-04-23T14:05:30.8686064Z df = pd.DataFrame( 2021-04-23T14:05:30.8686682Z {"primary": primaries, "secondary": secondary, "sorted_column": unsorted_column} 2021-04-23T14:05:30.8687201Z ) 2021-04-23T14:05:30.8687586Z secondary_indices = ["secondary"] 2021-04-23T14:05:30.8688090Z expected_num_indices = 2 # One primary 2021-04-23T14:05:30.8688465Z 2021-04-23T14:05:30.8688803Z # used for tests later on to 2021-04-23T14:05:30.8689169Z if bucket_by: 2021-04-23T14:05:30.8689629Z secondary_indices.append(bucket_by) 2021-04-23T14:05:30.8690111Z expected_num_indices = 3 2021-04-23T14:05:30.8690463Z 2021-04-23T14:05:30.8690937Z # shuffle all rows. properties of result should be reproducible 2021-04-23T14:05:30.8691558Z df = df.sample(frac=1).reset_index(drop=True) 2021-04-23T14:05:30.8692120Z ddf = dd.from_pandas(df, npartitions=npartitions) 2021-04-23T14:05:30.8692558Z 2021-04-23T14:05:30.8692868Z dataset_comp = func( 2021-04-23T14:05:30.8695197Z ddf, 2021-04-23T14:05:30.8695536Z store_factory, 2021-04-23T14:05:30.8695957Z dataset_uuid="output_dataset_uuid", 2021-04-23T14:05:30.8696377Z table="core", 2021-04-23T14:05:30.8696811Z secondary_indices=secondary_indices, 2021-04-23T14:05:30.8697265Z shuffle=True, 2021-04-23T14:05:30.8697633Z bucket_by=bucket_by, 2021-04-23T14:05:30.8698193Z repartition_ratio=repartition, 2021-04-23T14:05:30.8698667Z num_buckets=num_buckets, 2021-04-23T14:05:30.8699132Z sort_partitions_by="sorted_column", 2021-04-23T14:05:30.8699584Z partition_on=["primary"], 2021-04-23T14:05:30.8699936Z ) 2021-04-23T14:05:30.8700194Z 2021-04-23T14:05:30.8700695Z s = pickle.dumps(dataset_comp, pickle.HIGHEST_PROTOCOL) 2021-04-23T14:05:30.8701296Z dataset_comp = pickle.loads(s) 2021-04-23T14:05:30.8701660Z 2021-04-23T14:05:30.8702042Z > dataset = dataset_comp.compute() 2021-04-23T14:05:30.8702403Z 2021-04-23T14:05:30.8702812Z tests/io/dask/dataframe/test_shuffle.py:166: 2021-04-23T14:05:30.8703260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2021-04-23T14:05:30.8704157Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/base.py:284: in compute 2021-04-23T14:05:30.8704884Z (result,) = compute(self, traverse=False, **kwargs) 2021-04-23T14:05:30.8705764Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/base.py:566: in compute 2021-04-23T14:05:30.8706442Z results = schedule(dsk, keys, **kwargs) 2021-04-23T14:05:30.8707273Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/local.py:560: in get_sync 2021-04-23T14:05:30.8707883Z return get_async( 2021-04-23T14:05:30.8708648Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/local.py:503: in get_async 2021-04-23T14:05:30.8709360Z for key, res_info, failed in queue_get(queue).result(): 2021-04-23T14:05:30.8710058Z /usr/share/miniconda/envs/test/lib/python3.8/concurrent/futures/_base.py:432: in result 2021-04-23T14:05:30.8710706Z return self.__get_result() 2021-04-23T14:05:30.8711338Z /usr/share/miniconda/envs/test/lib/python3.8/concurrent/futures/_base.py:388: in __get_result 2021-04-23T14:05:30.8711946Z raise self._exception 2021-04-23T14:05:30.8712753Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/local.py:545: in submit 2021-04-23T14:05:30.8713383Z fut.set_result(fn(*args, **kwargs)) 2021-04-23T14:05:30.8714255Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/local.py:237: in batch_execute_tasks 2021-04-23T14:05:30.8714933Z return [execute_task(*a) for a in it] 2021-04-23T14:05:30.8715772Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/local.py:237: in 2021-04-23T14:05:30.8716434Z return [execute_task(*a) for a in it] 2021-04-23T14:05:30.8717263Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/local.py:228: in execute_task 2021-04-23T14:05:30.8717943Z result = pack_exception(e, dumps) 2021-04-23T14:05:30.8718784Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/local.py:223: in execute_task 2021-04-23T14:05:30.8719450Z result = _execute_task(task, data) 2021-04-23T14:05:30.8720273Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/core.py:121: in _execute_task 2021-04-23T14:05:30.8720977Z return func(*(_execute_task(a, cache) for a in args)) 2021-04-23T14:05:30.8721844Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/core.py:121: in 2021-04-23T14:05:30.8722692Z return func(*(_execute_task(a, cache) for a in args)) 2021-04-23T14:05:30.8723580Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/core.py:115: in _execute_task 2021-04-23T14:05:30.8724252Z return [_execute_task(a, cache) for a in arg] 2021-04-23T14:05:30.8725105Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/core.py:115: in 2021-04-23T14:05:30.8725777Z return [_execute_task(a, cache) for a in arg] 2021-04-23T14:05:30.8728707Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/core.py:121: in _execute_task 2021-04-23T14:05:30.8729434Z return func(*(_execute_task(a, cache) for a in args)) 2021-04-23T14:05:30.8730350Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/dataframe/core.py:103: in _concat 2021-04-23T14:05:30.8731271Z else methods.concat(args2, uniform=True, ignore_index=ignore_index) 2021-04-23T14:05:30.8732291Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/dataframe/methods.py:429: in concat 2021-04-23T14:05:30.8732928Z return func( 2021-04-23T14:05:30.8733766Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/dask/dataframe/methods.py:567: in concat_pandas 2021-04-23T14:05:30.8734514Z out = pd.concat(dfs3, join=join, sort=False) 2021-04-23T14:05:30.8735421Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/pandas/core/reshape/concat.py:290: in concat 2021-04-23T14:05:30.8736166Z op = _Concatenator( 2021-04-23T14:05:30.8737036Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/pandas/core/reshape/concat.py:470: in __init__ 2021-04-23T14:05:30.8737734Z self.new_axes = self._get_new_axes() 2021-04-23T14:05:30.8738655Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/pandas/core/reshape/concat.py:540: in _get_new_axes 2021-04-23T14:05:30.8739297Z return [ 2021-04-23T14:05:30.8740135Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/pandas/core/reshape/concat.py:541: in 2021-04-23T14:05:30.8740939Z self._get_concat_axis if i == self.bm_axis else self._get_comb_axis(i) 2021-04-23T14:05:30.8741937Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/pandas/core/reshape/concat.py:547: in _get_comb_axis 2021-04-23T14:05:30.8742655Z return get_objs_combined_axis( 2021-04-23T14:05:30.8743563Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/pandas/core/indexes/api.py:98: in get_objs_combined_axis 2021-04-23T14:05:30.8744468Z return _get_combined_index(obs_idxes, intersect=intersect, sort=sort, copy=copy) 2021-04-23T14:05:30.8745541Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/pandas/core/indexes/api.py:151: in _get_combined_index 2021-04-23T14:05:30.8746301Z index = union_indexes(indexes, sort=sort) 2021-04-23T14:05:30.8747227Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/pandas/core/indexes/api.py:223: in union_indexes 2021-04-23T14:05:30.8747940Z result = result.union(other) 2021-04-23T14:05:30.8748362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2021-04-23T14:05:30.8748594Z 2021-04-23T14:05:30.8748921Z self = MultiIndex([(1, 0), 2021-04-23T14:05:30.8749249Z (2, 0), 2021-04-23T14:05:30.8749537Z (3, 0)], 2021-04-23T14:05:30.8750063Z names=['primary', '__KTK_HASH_BUCKET']) 2021-04-23T14:05:30.8750579Z other = RangeIndex(start=0, stop=0, step=1), sort = None 2021-04-23T14:05:30.8750920Z 2021-04-23T14:05:30.8751186Z @final 2021-04-23T14:05:30.8751564Z def union(self, other, sort=None): 2021-04-23T14:05:30.8751915Z """ 2021-04-23T14:05:30.8752293Z Form the union of two Index objects. 2021-04-23T14:05:30.8752663Z 2021-04-23T14:05:30.8753138Z If the Index objects are incompatible, both Index objects will be 2021-04-23T14:05:30.8753839Z cast to dtype('object') first. 2021-04-23T14:05:30.8754181Z 2021-04-23T14:05:30.8754546Z .. versionchanged:: 0.25.0 2021-04-23T14:05:30.8754892Z 2021-04-23T14:05:30.8755203Z Parameters 2021-04-23T14:05:30.8755637Z ---------- 2021-04-23T14:05:30.8756138Z other : Index or array-like 2021-04-23T14:05:30.8756576Z sort : bool or None, default None 2021-04-23T14:05:30.8757063Z Whether to sort the resulting Index. 2021-04-23T14:05:30.8757438Z 2021-04-23T14:05:30.8757820Z * None : Sort the result, except when 2021-04-23T14:05:30.8758202Z 2021-04-23T14:05:30.8758545Z 1. `self` and `other` are equal. 2021-04-23T14:05:30.8758985Z 2. `self` or `other` has length 0. 2021-04-23T14:05:30.8759487Z 3. Some values in `self` or `other` cannot be compared. 2021-04-23T14:05:30.8760066Z A RuntimeWarning is issued in this case. 2021-04-23T14:05:30.8760470Z 2021-04-23T14:05:30.8763120Z * False : do not sort the result. 2021-04-23T14:05:30.8763478Z 2021-04-23T14:05:30.8763827Z .. versionadded:: 0.24.0 2021-04-23T14:05:30.8764158Z 2021-04-23T14:05:30.8764519Z .. versionchanged:: 0.24.1 2021-04-23T14:05:30.8764865Z 2021-04-23T14:05:30.8765280Z Changed the default value from ``True`` to ``None`` 2021-04-23T14:05:30.8765800Z (without change in behaviour). 2021-04-23T14:05:30.8766158Z 2021-04-23T14:05:30.8766453Z Returns 2021-04-23T14:05:30.8766917Z ------- 2021-04-23T14:05:30.8769767Z union : Index 2021-04-23T14:05:30.8770146Z 2021-04-23T14:05:30.8770446Z Examples 2021-04-23T14:05:30.8770883Z -------- 2021-04-23T14:05:30.8771242Z Union matching dtypes 2021-04-23T14:05:30.8771575Z 2021-04-23T14:05:30.8771906Z >>> idx1 = pd.Index([1, 2, 3, 4]) 2021-04-23T14:05:30.8772288Z >>> idx2 = pd.Index([3, 4, 5, 6]) 2021-04-23T14:05:30.8772671Z >>> idx1.union(idx2) 2021-04-23T14:05:30.8773233Z Int64Index([1, 2, 3, 4, 5, 6], dtype='int64') 2021-04-23T14:05:30.8773600Z 2021-04-23T14:05:30.8773951Z Union mismatched dtypes 2021-04-23T14:05:30.8774290Z 2021-04-23T14:05:30.8774769Z >>> idx1 = pd.Index(['a', 'b', 'c', 'd']) 2021-04-23T14:05:30.8775162Z >>> idx2 = pd.Index([1, 2, 3, 4]) 2021-04-23T14:05:30.8775542Z >>> idx1.union(idx2) 2021-04-23T14:05:30.8776104Z Index(['a', 'b', 'c', 'd', 1, 2, 3, 4], dtype='object') 2021-04-23T14:05:30.8776472Z 2021-04-23T14:05:30.8776782Z MultiIndex case 2021-04-23T14:05:30.8777109Z 2021-04-23T14:05:30.8777530Z >>> idx1 = pd.MultiIndex.from_arrays( 2021-04-23T14:05:30.8778019Z ... [[1, 1, 2, 2], ["Red", "Blue", "Red", "Blue"]] 2021-04-23T14:05:30.8778372Z ... ) 2021-04-23T14:05:30.8778651Z >>> idx1 2021-04-23T14:05:30.8779128Z MultiIndex([(1, 'Red'), 2021-04-23T14:05:30.8779589Z (1, 'Blue'), 2021-04-23T14:05:30.8780021Z (2, 'Red'), 2021-04-23T14:05:30.8780443Z (2, 'Blue')], 2021-04-23T14:05:30.8780745Z ) 2021-04-23T14:05:30.8781155Z >>> idx2 = pd.MultiIndex.from_arrays( 2021-04-23T14:05:30.8781672Z ... [[3, 3, 2, 2], ["Red", "Green", "Red", "Green"]] 2021-04-23T14:05:30.8782040Z ... ) 2021-04-23T14:05:30.8782318Z >>> idx2 2021-04-23T14:05:30.8782797Z MultiIndex([(3, 'Red'), 2021-04-23T14:05:30.8783266Z (3, 'Green'), 2021-04-23T14:05:30.8783705Z (2, 'Red'), 2021-04-23T14:05:30.8784135Z (2, 'Green')], 2021-04-23T14:05:30.8784446Z ) 2021-04-23T14:05:30.8784763Z >>> idx1.union(idx2) 2021-04-23T14:05:30.8785277Z MultiIndex([(1, 'Blue'), 2021-04-23T14:05:30.8785739Z (1, 'Red'), 2021-04-23T14:05:30.8786171Z (2, 'Blue'), 2021-04-23T14:05:30.8786595Z (2, 'Green'), 2021-04-23T14:05:30.8787029Z (2, 'Red'), 2021-04-23T14:05:30.8787466Z (3, 'Green'), 2021-04-23T14:05:30.8787889Z (3, 'Red')], 2021-04-23T14:05:30.8788187Z ) 2021-04-23T14:05:30.8788538Z >>> idx1.union(idx2, sort=False) 2021-04-23T14:05:30.8789087Z MultiIndex([(1, 'Red'), 2021-04-23T14:05:30.8789549Z (1, 'Blue'), 2021-04-23T14:05:30.8789977Z (2, 'Red'), 2021-04-23T14:05:30.8790393Z (2, 'Blue'), 2021-04-23T14:05:30.8790820Z (3, 'Red'), 2021-04-23T14:05:30.8791241Z (3, 'Green'), 2021-04-23T14:05:30.8791686Z (2, 'Green')], 2021-04-23T14:05:30.8791987Z ) 2021-04-23T14:05:30.8792264Z """ 2021-04-23T14:05:30.8792630Z self._validate_sort_keyword(sort) 2021-04-23T14:05:30.8793104Z self._assert_can_do_setop(other) 2021-04-23T14:05:30.8793640Z other, result_name = self._convert_can_do_setop(other) 2021-04-23T14:05:30.8794058Z 2021-04-23T14:05:30.8794572Z if not is_dtype_equal(self.dtype, other.dtype): 2021-04-23T14:05:30.8795218Z if isinstance(self, ABCMultiIndex) and not is_object_dtype( 2021-04-23T14:05:30.8795797Z unpack_nested_dtype(other) 2021-04-23T14:05:30.8796155Z ): 2021-04-23T14:05:30.8796590Z > raise NotImplementedError( 2021-04-23T14:05:30.8797211Z "Can only union MultiIndex with MultiIndex or Index of tuples, " 2021-04-23T14:05:30.8797836Z "try mi.to_flat_index().union(other) instead." 2021-04-23T14:05:30.8798236Z ) 2021-04-23T14:05:30.8798974Z E NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead. 2021-04-23T14:05:30.8799661Z 2021-04-23T14:05:30.8800564Z /usr/share/miniconda/envs/test/lib/python3.8/site-packages/pandas/core/indexes/base.py:2929: NotImplementedError ```
xhochy commented 3 years ago

Reported with dask: https://github.com/dask/dask/issues/7610

mlondschien commented 3 years ago

FYI this was fixed in https://github.com/pandas-dev/pandas/pull/41275.