apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.29k stars 3.47k forks source link

OSError: Invalid IPC stream: negative continuation token #28475

Open asfimport opened 3 years ago

asfimport commented 3 years ago

pyarrow 4.0.0

 


 
File "/home/ftdb/anaconda3/lib/python3.8/site-packages/pandas/io/feather_format.py", line 127, in read_feather
 84 def read_feather(
 85 path, columns=None, use_threads: bool = True, storage_options: StorageOptions = None
 86 ):
 (...)
 123 with get_handle(
 124 path, "rb", storage_options=storage_options, is_text=False
 125 ) as handles:
 126 
--> 127 return feather.read_feather(
 128 handles.handle, columns=columns, use_threads=bool(use_threads)
 ..................................................
 path = PosixPath('/home/ftdb/data/jobs1.feather.011')
 columns = None
 use_threads = True
 storage_options = None
 StorageOptions = typing.Union[typing.Dict[str, typing.Any], NoneType]
 handles = IOHandles(handle=<_io.BufferedReader name='/home/ftdb/data/j
 obs1.feather.011'>, compression={'method': None}, created_ha
 ndles=[], is_wrapped=False, is_mmap=False)
 feather.read_feather = <function 'read_feather' feather.py:195>
 handles.handle = <_io.BufferedReader name='/home/ftdb/data/jobs1.feather.011'
 >
 ..................................................
File "/home/ftdb/anaconda3/lib/python3.8/site-packages/pyarrow/feather.py", line 216, in read_feather
 195 def read_feather(source, columns=None, use_threads=True, memory_map=True):
 (...)
 212 -------
 213 df : pandas.DataFrame
 214 """
 215 _check_pandas_version()
--> 216 return (read_table(source, columns=columns, memory_map=memory_map)
 217 .to_pandas(use_threads=use_threads))
 ..................................................
 source = <_io.BufferedReader name='/home/ftdb/data/jobs1.feather.011'
 >
 columns = None
 use_threads = True
 memory_map = True
 ..................................................
File "/home/ftdb/anaconda3/lib/python3.8/site-packages/pyarrow/feather.py", line 241, in read_table
 220 def read_table(source, columns=None, memory_map=True):
 (...)
 237 reader = ext.FeatherReader()
 238 reader.open(source, use_memory_map=memory_map)
 239 
 240 if columns is None:
--> 241 return reader.read()
 242 
 ..................................................
 source = <_io.BufferedReader name='/home/ftdb/data/jobs1.feather.011'
 >
 columns = None
 memory_map = True
 reader = <pyarrow.lib.FeatherReader object at 0x7f5f0afc2180>
 ext.FeatherReader = <class 'pyarrow.lib.FeatherReader'>
 ..................................................
File "pyarrow/feather.pxi", line 76, in pyarrow.lib.FeatherReader.read
File "pyarrow/error.pxi", line 112, in pyarrow.lib.check_status
---- (full traceback above) ----
File "/home/ftdb/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
 exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-4-a3c2240634fd>", line 1, in <module>
 df = read_feather(p)
File "/storage/code/fintechdb/Ftools/ftools/functoolz.py", line 22, in inner
 return func(*args, **kwargs)
File "/storage/code/fintechdb/Ftools/ftools/pathtools.py", line 51, in inner
 return func(**new_kwargs)
File "/storage/code/fintechdb/Ftools/ftools/io.py", line 506, in read_feather
 data = pd.read_feather(path, columns, use_threads=True)
File "/home/ftdb/anaconda3/lib/python3.8/site-packages/pandas/io/feather_format.py", line 127, in read_feather
 return feather.read_feather(
File "/home/ftdb/anaconda3/lib/python3.8/site-packages/pyarrow/feather.py", line 216, in read_feather
 return (read_table(source, columns=columns, memory_map=memory_map)
File "/home/ftdb/anaconda3/lib/python3.8/site-packages/pyarrow/feather.py", line 241, in read_table
 return reader.read()
File "pyarrow/feather.pxi", line 76, in pyarrow.lib.FeatherReader.read
File "pyarrow/error.pxi", line 112, in pyarrow.lib.check_status
OSError: Invalid IPC stream: negative continuation token

 

Reporter: Torstein Sørnes

Note: This issue was originally created as ARROW-12733. Please see the migration documentation for further details.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Hi [~tsoernes] , can you explain how you encountered this error? Can you post the file you're trying to read somewhere, and/or can you explain how it was generated?

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: Hi [~tsoernes], can you provide some more information, and ideally a reproducible example?
(eg was the feather file written with the same version of pyarrow or with a previous one? And if a previous one, which version? Did you use compression? What kind of data does the file contain? Could you provide a small script that generates a feather file that reproduces the issue?)

EDIT: whoops, sorry for the duplicate comment asking for more information. JIRA isn't very good at refreshing / indicating there are newer comments if the tab was already open ...

asfimport commented 3 years ago

Torstein Sørnes: @jorisvandenbossche @pitrou

The file is compressed with lz4. It is a Pandas dataframe. The code for writing it, is:


df.to_feather(path, compression='lz4')

where df is a pandas dataframe.

The file was written with the same version of pyarrow and pandas, as it is trying to being read.

I have written, and read, successfully, hundreds of pandas dataframe arrow files using exactly the same code, and library versions. I have no idea why this in particular, fails.

The file is too big to upload here. Does this link work?

https://ml-pull.s3.eu-central-1.amazonaws.com/Jobs/glassdoor/jobs1.feather.011

Cheers, and thanks for you work.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Thank you, I'll take a look.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: [~tsoernes] It seems the file is invalid indeed. Have you tried recreating it from the exact same data?