apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[C++] Improve error message when reading Streaming file with File reader and vice versa #33823

Open domoritz opened 1 year ago

domoritz commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

I am trying to loan an arrow file

with pa.memory_map('flights-200k.arrow', 'r') as source:
    my_arrow = pa.ipc.open_file(source).read_all()

but get this error

  File "/opt/homebrew/Caskroom/miniforge/base/envs/ramsch/lib/python3.10/site-packages/pyarrow/ipc.py", line 228, in open_file
    return RecordBatchFileReader(
  File "/opt/homebrew/Caskroom/miniforge/base/envs/ramsch/lib/python3.10/site-packages/pyarrow/ipc.py", line 110, in __init__
    self._open(source, footer_offset=footer_offset,
  File "pyarrow/ipc.pxi", line 862, in pyarrow.lib._RecordBatchFileReader._open
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Not an Arrow file

The arrow file is https://github.com/uwdata/flights-arrow/blob/master/flights-200k.arrow and loads fine in the arrow js library.

Component(s)

Python

domoritz commented 1 year ago

The issue is that I need to use open_stream. The error message should be better.

vibhatha commented 1 year ago

I am not sure if this is related. I had a similar experience when I have mistakenly written files but haven't closed the file writer. In your case since it is loaded in JS properly, this could be an entirely different scenario. But thought it is worth mentioning here.

jorisvandenbossche commented 1 year ago

Small reproducer without having to download a file:

import pyarrow as pa

batch = pa.record_batch([pa.array([1, 2, 3])], ['a'])

# Create an Arrow Stream file
with pa.ipc.new_stream("test.arrows", batch.schema) as writer:
    writer.write(batch)

# Read as Arrow File
pa.ipc.open_file("test.arrows")
# -> ... ArrowInvalid: Not an Arrow file

I agree it would be nice we can give a more informative error message and hint the user they are reading a Arrow Streaming format file and not a Arrow File format file.

jorisvandenbossche commented 1 year ago

Similarly, reading a File with a Streaming reader also gives a non-informative error message:

with pa.ipc.new_file("test.arrow", batch.schema) as writer:
    writer.write(batch)

pa.ipc.open_stream("test.arrow")
# ... ArrowInvalid: Expected to read 1330795073 metadata bytes, but only read 486
westonpace commented 1 year ago

Files written in the file format have a magic number on both sides of the data. The error message "Not an Arrow file" is thrown when that magic number is wrong. So we already detect this situation, we just need to be proactive about suggesting solutions / alternatives (e.g. "Not an Arrow file, perhaps this is in the streaming format?") so this should be very doable.

domoritz commented 1 year ago

I am an arrow committer and got totally thrown off by the error message and thought my file was corrupt. So yes, your suggested error message sounds great.

jalajk24 commented 1 year ago

@domoritz i would like to contribute in this project can you assign this project to me

domoritz commented 1 year ago

I assigned it to you. Please send a pull request soon.

pbaner16 commented 10 months ago

Hello @domoritz -- has this issue been fixed? If not, i can contribute!