coiled / etl-tpch

BSD 3-Clause "New" or "Revised" License
12 stars 1 forks source link

First local runthrough #5

Open mrocklin opened 7 months ago

mrocklin commented 7 months ago

I understand that this is very early and not yet ready for prime-time, but I tried running through this locally and had some issues:

  File "/Users/mrocklin/workspace/etl-tpch/pipeline/resize.py", line 28, in repartition_table
    df.to_parquet(outdir, compression="snappy", name_function=name)
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask_expr/_collection.py", line 2154, in to_parquet
    return to_parquet(self, path, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask_expr/io/parquet.py", line 383, in to_parquet
    out = out.compute(**compute_kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask_expr/_collection.py", line 366, in compute
    return DaskMethodsMixin.compute(out, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/base.py", line 377, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/base.py", line 663, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py", line 97, in __call__
    return read_parquet_part(
           ^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py", line 645, in read_parquet_part
    dfs = [
          ^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py", line 646, in <listcomp>
    func(
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py", line 640, in read_partition
    arrow_table = cls._read_table(
                  ^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py", line 1773, in _read_table
    arrow_table = _read_table_from_path(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py", line 263, in _read_table_from_path
    return pq.ParquetFile(fil, **pre_buffer).read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 318, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1470, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
convert_to_parquet-96daff616b2cee173abf09be2e6d04d4 ValueError('Unmatched \'\'"\' when when decoding \'string\'')     File "/Users/mrocklin/workspace/etl-tpch/pipeline/preprocess.py", line 29, in convert_to_parquet\n    df = pd.read_json(file, compression="zstd")\n         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 815, in read_json\n    return json_reader.read()\n           ^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1025, in read\n    obj = self._get_object_parser(self.data)\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser\n    obj = FrameParser(json, **kwargs).parse()\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1187, in parse\n    self._parse()\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1400, in _parse\n    ujson_loads(json, precise_float=self.precise_float), dtype=None\n    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n    tcp://127.0.0.1:52359   1
convert_to_parquet-b3c6b4c0da0395cda7932836ed79b155 ValueError("No ':' found when decoding object value")     File "/Users/mrocklin/workspace/etl-tpch/pipeline/preprocess.py", line 29, in convert_to_parquet\n    df = pd.read_json(file, compression="zstd")\n         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 815, in read_json\n    return json_reader.read()\n           ^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1025, in read\n    obj = self._get_object_parser(self.data)\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser\n    obj = FrameParser(json, **kwargs).parse()\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1187, in parse\n    self._parse()\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1400, in _parse\n    ujson_loads(json, precise_float=self.precise_float), dtype=None\n    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n    tcp://127.0.0.1:52212   1

As expected of course. If I have some time this weekend I may poke around a little.

jrbourbeau commented 7 months ago

Thanks for trying this out @mrocklin. I appreciate the feedback.

Totally agree the intervals need to be adjusted (their still in "run quickly to I can debug fast" mode).

The parquet error definitely looks strange, I've not seen it locally (at least not yet). I'll take a look on Monday

jrbourbeau commented 7 months ago

Data volume was too high for my mac

This was handled in https://github.com/coiled/etl-tpch/pull/7