dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
784 stars 178 forks source link

Cannot write simple dataframe to disk in thrift 0.11.0 #280

Closed bschreck closed 6 years ago

bschreck commented 6 years ago

Is there something simple I'm missing here? I'm just trying to do the most basic thing in the example:

df = pd.DataFrame(np.zeros((1000,1000)), columns=[str(i) for i in range(1000)])
from fastparquet import write
write('outfile2.parq', df)
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
       compression='GZIP', file_scheme='hive')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-5b2fbc3e1a9e> in <module>()
      1 write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
----> 2       compression='GZIP', file_scheme='hive')
      3

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/writer.py in write(filename, data, row_group_offsets, compression, file_scheme, open_with, mkdirs, has_nulls, write_index, partition_on, fixed_text, append, object_encoding, times)
    831                 with open_with(partname, 'wb') as f2:
    832                     rg = make_part_file(f2, data[start:end], fmd.schema,
--> 833                                         compression=compression, fmd=fmd)
    834                 for chunk in rg.columns:
    835                     chunk.file_path = part

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/writer.py in make_part_file(f, data, schema, compression, fmd)
    604     with f as f:
    605         f.write(MARKER)
--> 606         rg = make_row_group(f, data, schema, compression=compression)
    607         if fmd is None:
    608             fmd = parquet_thrift.FileMetaData(num_rows=len(data),

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/writer.py in make_row_group(f, data, schema, compression)
    592                 comp = compression
    593             chunk = write_column(f, data[column.name], column,
--> 594                                  compression=comp)
    595             rg.columns.append(chunk)
    596     rg.total_byte_size = sum([c.meta_data.total_uncompressed_size for c in

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/writer.py in write_column(f, data, selement, compression)
    532                                    data_page_header=dph, crc=None)
    533
--> 534     write_thrift(f, ph)
    535     f.write(bdata)
    536

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/thrift_structures.py in write_thrift(fobj, thrift)
     47     pout = TCompactProtocol(fobj)
     48     try:
---> 49         thrift.write(pout)
     50         fail = False
     51     except TProtocolException as e:

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py in write(self, oprot)
   1027     def write(self, oprot):
   1028         if oprot._fast_encode is not None and self.thrift_spec is not None:
-> 1029             oprot.trans.write(oprot._fast_encode(self, (self.__class__, self.thrift_spec)))
   1030             return
   1031         oprot.writeStructBegin('PageHeader')

TypeError: expecting list of size 2 for struct args

Same error on my local Mac and remote EC2 ubuntu 16.04 instance

martindurant commented 6 years ago

row_group_offsets=[0, 10000, 20000] - your data is only 1000 rows long, so these offsets do not make sense. The traceback is not very helpful, I agree...

bschreck commented 6 years ago

But it also doesn't work without any arguments, or with 10,000 rows. Same error

martindurant commented 6 years ago

It does appear to work for me without arguments - can you please try with the master version of fastparquet? (e.g., pip install git+https://github.com/dask/fastparquet)

bschreck commented 6 years ago

Same error (used a fresh virtualenv too).

I'm on MacOS High Sierra, using Python 3.6.4

~/miniconda3/envs/d3m_new_new/lib/python3.6/site-packages/fastparquet/writer.py in write(filename, data, row_group_offsets, compression, file_scheme, open_with, mkdirs, has_nulls, write_index, partition_on, fixed_text, append, object_encoding, times)
    807     if file_scheme == 'simple':
    808         write_simple(filename, data, fmd, row_group_offsets,
--> 809                      compression, open_with, has_nulls, append)
    810     elif file_scheme in ['hive', 'drill']:
    811         if append:

~/miniconda3/envs/d3m_new_new/lib/python3.6/site-packages/fastparquet/writer.py in write_simple(fn, data, fmd, row_group_offsets, compression, open_with, has_nulls, append)
    704                    else None)
    705             rg = make_row_group(f, data[start:end], fmd.schema,
--> 706                                 compression=compression)
    707             if rg is not None:
    708                 fmd.row_groups.append(rg)

~/miniconda3/envs/d3m_new_new/lib/python3.6/site-packages/fastparquet/writer.py in make_row_group(f, data, schema, compression)
    601                 comp = compression
    602             chunk = write_column(f, data[column.name], column,
--> 603                                  compression=comp)
    604             rg.columns.append(chunk)
    605     rg.total_byte_size = sum([c.meta_data.total_uncompressed_size for c in

~/miniconda3/envs/d3m_new_new/lib/python3.6/site-packages/fastparquet/writer.py in write_column(f, data, selement, compression)
    541                                    data_page_header=dph, crc=None)
    542
--> 543     write_thrift(f, ph)
    544     f.write(bdata)
    545

~/miniconda3/envs/d3m_new_new/lib/python3.6/site-packages/fastparquet/thrift_structures.py in write_thrift(fobj, thrift)
     49     pout = TCompactProtocol(fobj)
     50     try:
---> 51         thrift.write(pout)
     52         fail = False
     53     except TProtocolException as e:

~/miniconda3/envs/d3m_new_new/lib/python3.6/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py in write(self, oprot)
   1084     def write(self, oprot):
   1085         if oprot._fast_encode is not None and self.thrift_spec is not None:
-> 1086             oprot.trans.write(oprot._fast_encode(self, (self.__class__, self.thrift_spec)))
   1087             return
   1088         oprot.writeStructBegin('PageHeader')

TypeError: expecting list of size 2 for struct args
martindurant commented 6 years ago

Are you using thrift version 0.10.0?

bschreck commented 6 years ago

0.11.0

That does seem to be the issue. Installing 0.10.0 fixes it. Maybe update your requirements to force 0.10.0 exactly?

martindurant commented 6 years ago

That was released on pypi at 2018-01-11, and is not available on conda yet. Would you mind trying with v0.10.0?

bschreck commented 6 years ago

Yeah it works with 0.10.0

martindurant commented 6 years ago

cc @mariusvniekerk

martindurant commented 6 years ago

@bschreck , thanks for noticing and reporting this issue. It seems that after a fix, we had better release a new version fastparquet, hopefully soon after the new thrift conda package arrives.

martindurant commented 6 years ago

https://github.com/conda-forge/thrift-feedstock/pull/8

mariusvniekerk commented 6 years ago

https://github.com/conda-forge/thrift-cpp-feedstock/pull/15