dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
770 stars 176 forks source link

Test failure test_frame_write_read_verify #517

Open FRidh opened 4 years ago

FRidh commented 4 years ago

Test failure with 0.4.1 (and 0.4.0) cloned from this repo, with Python 3.8.

=================================== FAILURES ===================================
_ test_frame_write_read_verify[input_symbols8-10-hive-2-partitions8-filters8] __

tempdir = '/build/tmpighy8d7p', input_symbols = ['NOW', 'SPY', 'VIX']
input_days = 10, file_scheme = 'hive', input_columns = 2
partitions = ['symbol', 'dtTrade']
filters = [('dtTrade', '==', '2005-01-02T00:00:00.000000000')]

    @pytest.mark.parametrize('input_symbols,input_days,file_scheme,input_columns,'
                             'partitions,filters',
                             [
                                 (['NOW', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['now', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['TODAY', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['VIX*', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['QQQ*', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['QQQ!', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['Q%QQ', 'SPY', 'VIX'], 2 * 252, 'hive', 2,
                                  ['symbol', 'year'], [('symbol', '==', 'SPY')]),
                                 (['NOW', 'SPY', 'VIX'], 10, 'hive', 2,
                                  ['symbol', 'dtTrade'], [('symbol', '==', 'SPY')]),
                                 (['NOW', 'SPY', 'VIX'], 10, 'hive', 2,
                                  ['symbol', 'dtTrade'],
                                  [('dtTrade', '==',
                                    '2005-01-02T00:00:00.000000000')]),
                                 (['NOW', 'SPY', 'VIX'], 10, 'hive', 2,
                                  ['symbol', 'dtTrade'],
                                  [('dtTrade', '==',
                                    Timestamp('2005-01-01 00:00:00'))]),
                             ]
                             )
    def test_frame_write_read_verify(tempdir, input_symbols, input_days,
                                     file_scheme,
                                     input_columns, partitions, filters):
        if os.name == 'nt':
            pytest.xfail("Partitioning folder names contain special characters which are not supported on Windows")

        # Generate Temp Director for parquet Files
        fdir = str(tempdir)
        fname = os.path.join(fdir, 'test')

        # Generate Test Input Frame
        input_df = frame_symbol_dtTrade_type_strike(days=input_days,
                                                    symbols=input_symbols,
                                                    numbercolumns=input_columns)
        input_df.reset_index(inplace=True)
        write(fname, input_df, partition_on=partitions, file_scheme=file_scheme,
              compression='SNAPPY')

        # Read Back Whole Parquet Structure
        output_df = ParquetFile(fname).to_pandas()
        for col in output_df.columns:
            assert col in input_df.columns.values
        assert len(input_df) == len(output_df)

        # Read with filters
        filtered_output_df = ParquetFile(fname).to_pandas(filters=filters)

        # Filter Input Frame to Match What Should Be Expected from parquet read
        # Handle either string or non-string inputs / works for timestamps
        filterStrings = []
        for name, operator, value in filters:
            if isinstance(value, str):
                value = "'{}'".format(value)
            else:
                value = value.__repr__()
            filterStrings.append("{} {} {}".format(name, operator, value))
        filters_expression = " and ".join(filterStrings)
        filtered_input_df = input_df.query(filters_expression)

        # Check to Ensure Columns Match
        for col in filtered_output_df.columns:
            assert col in filtered_input_df.columns.values
        # Check to Ensure Number of Rows Match
>       assert len(filtered_input_df) == len(filtered_output_df)
E       assert 3 == 0
E         +3
E         -0

fastparquet/test/test_partition_filters_specialstrings.py:109: AssertionError

Environment:

Build/test/run-time dependencies:

$ nix show-derivation -f . python3.pkgs.fastparquet | jq  '.[].inputDrvs | keys'
[
  "/nix/store/0fl8gz98vq7k0xpphn0ayx36illf7v8c-python-remove-tests-dir-hook.drv",
  "/nix/store/11vhhyvc6cz433snizyqdkpg7k2q5zkf-python3.8-pytest-runner-5.2.drv",
  "/nix/store/1a661nb0dli97gw6qy50msp85ll680rz-python-imports-check-hook.sh.drv",
  "/nix/store/22c15w3md8d2jdi7awb2k50392by8x6g-python3.8-thrift-0.13.0.drv",
  "/nix/store/29r49aa4sz6hypb3gv5sdw330vj2j2ii-python3.8-numpy-1.19.1.drv",
  "/nix/store/377gwr2f2il0mi2kmq0yah2knhsyhsd5-hook.drv",
  "/nix/store/3h7k0zvr8psgmz4nyh17z1isjsj7px72-pip-install-hook.drv",
  "/nix/store/3vgc68qbg9c5qhb18xc41ihaqw0bng6l-python3.8-setuptools-47.3.1.drv",
  "/nix/store/4qry96ap0kpkjwjlsyc8p3m3hh6pg5pv-bash-4.4-p23.drv",
  "/nix/store/5y6w15gqfhiiw3v79ybqsai55c48k88p-python3.8-zstd-1.4.5.1.drv",
  "/nix/store/7r9z46n4rccnzdr3l3nxz1qvnsc6gcbz-setuptools-check-hook.drv",
  "/nix/store/7ryff7q11maypkrqg0k4hpj57m7xb5sw-python3.8-pandas-1.1.1.drv",
  "/nix/store/80pzh07z7qxq1j6v4bnj1qmrv9arwjmj-python3.8-pytest-5.4.3.drv",
  "/nix/store/877v1y795mz5qa2mji8mrrm6an7ryif8-python3.8-numba-0.51.1.drv",
  "/nix/store/9jp75f9q5spp2wwyml63yf7lkciqz4cr-source.drv",
  "/nix/store/caad1plf2ddqrjrmhvmraaksdcmhcn0q-python-catch-conflicts-hook.drv",
  "/nix/store/czk62c3arggf1w17nmxcgnxjslx9qxz6-python-remove-bin-bytecode-hook.drv",
  "/nix/store/myrlr2xv6zwmwm634frd01rirjxk1a40-python3.8-python-lz4-2.1.10.drv",
  "/nix/store/n0w17xq75lr9vx6qiw28097ymrifvkl0-python-recompile-bytecode-hook.drv",
  "/nix/store/nizihiiy8gcwn61sfd538vq0bf3ll5ll-stdenv-linux.drv",
  "/nix/store/q1q5dsc3pcx10clb38gyrbrgivl47kl8-python3-3.8.5.drv",
  "/nix/store/rhl55hlw72qc7a8qz82xp28xs1kq69qm-hook.drv",
  "/nix/store/sjg4vq1gjzipd76zzijxqq04bzlz2iqp-python3.8-python-snappy-0.5.4.drv",
  "/nix/store/vfrswqlwnpz9shla74fv6irdncngj8h5-python-namespaces-hook.sh.drv",
  "/nix/store/vk9rkwnkmgn9knnwbvxwjbzrxi45s965-setuptools-setup-hook.drv"
]
risicle commented 3 years ago

Working with a similar setup myself, the critical component here seems to be using pandas > 1.0.5.

martindurant commented 3 years ago

the critical component here seems to be using pandas > 1.0.5.

Do you know why this makes a difference? Perhaps we should drop the use of query in favour of an explicit expression.

risicle commented 3 years ago

Sorry, I don't. I just know that given two setups identical apart from pandas version with throw this error for me with pandas >= 1.1.0

martindurant commented 3 years ago

Hm, I have pandas 1.1.0, and it still passes for me locally :|

veprbl commented 3 years ago

Bisecting nixpkgs points at https://github.com/NixOS/nixpkgs/commit/2dafde493f153dba0eb4b34cd49763ee78eda3d9 as the first bad commit.

risicle commented 3 years ago

Indeed, if you simply revert pandas back to that prior version on an otherwise unmodified master, the error reoccurs.

@martindurant if you have Nix installed, we can guide you to a reproducible installation that demonstrates this.

martindurant commented 3 years ago

@TomAugspurger , in case you are bored and fancy tracing a pandas thing

TomAugspurger commented 3 years ago

Nothing comes to mind immediately, and I won't have time to debug this short-term.

veprbl commented 3 years ago

It seems like the difference is occuring in the generation of the file path https://github.com/dask/fastparquet/blob/a8cb8d1a28eb2db4ada233052cbc01bf815c2551/fastparquet/writer.py#L952-L971

There are difference in behaviour of groupby for multi index, it can be seen in a following example:

import numpy as np
import pandas as pd
print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)

In a previous version it used to preserve the type

# nix-shell -p python3Packages.pandas -p python3Packages.numpy -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/69cb94ebb3193fc5077ee99ab2b50353151466ae.tar.gz --run 'python3 -c "import numpy as np; import pandas as pd; print(pd.__version__); print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)"'
1.0.5
{(numpy.datetime64('2020-01-01T00:00:00.000000000'), 12345): array([0])}

but now started to perform a conversion

# nix-shell -p python3Packages.pandas -p python3Packages.numpy -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/2dafde493f153dba0eb4b34cd49763ee78eda3d9.tar.gz --run 'python3 -c "import numpy as np; import pandas as pd; print(pd.__version__); print(pd.DataFrame([(np.datetime64(\"2020-01-01\"), 12345)]).groupby([0, 1]).indices)"'
1.1.0
{(Timestamp('2020-01-01 00:00:00'), 12345): array([0])}

Hm, I have pandas 1.1.0, and it still passes for me locally :|

@martindurant That might be because you've changed the compared value as a part of a8cb8d1a28eb2db4ada233052cbc01bf815c2551. That should have broken the test on older pandas versions such as 1.0.5.

martindurant commented 3 years ago

Hm, reflexive coding. We could put in a pandas version-dependent block in the test, then. This is already a longer thread than I had thought this would cause!

martindurant commented 3 years ago

OK, so changing the test matrix element to

                              [('dtTrade', '==',
                                Timestamp('2005-01-02 00:00:00'))]),

should fix it! I see this was already done for another element. The comparison with Timestamp should cast the value whether its a string or numpy value.

veprbl commented 3 years ago

I had some random thoughts on the issue:

The names of partitioning directories in the "hive" were changed because the dates were rendered to string with default format of the type. Would that be an issue?

Also, it seems like pandas has some aversion to storing np.datetime64 in the index, so it appears that the behaviour in 1.1.0 is not a bug.

martindurant commented 3 years ago

The names of partitioning directories in the "hive" were changed because the dates were rendered to string with default format of the type

Correct, we think this is what's going on

it appears that the behaviour in 1.1.0 is not a bug

Well, it's a change in behaviour, hence the problem for us. Perhaps wrapping in Timestamp in the expected value solves this for all cases.