apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.2k stars 3.46k forks source link

[Python] Better document some read_csv corner cases #23880

Closed asfimport closed 4 years ago

asfimport commented 4 years ago

Hi, I have found two problematic cases, possibly bugs, in pyarrow read_csv module. I have written the following piece of code and run a test on the attached CSV file.

The code compares pandas read_csv with pyarrow csv to show that the second is not behaving correctly with the following set of parameters:

  1. change parameter skip_rows = 10,

Traceback (most recent call last): File "/home/athan/anaconda3/envs/TRIADB/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 4, in read_options=csv.ReadOptions(skip_rows=skip_rows, autogenerate_column_names=False, use_threads=True, column_names=column_names) File "pyarrow/_csv.pyx", line 541, in pyarrow._csv.read_csv File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowKeyError: Column 'catcost' in include_columns does not exist in CSV file


2. change parameters skip_rows = 12, columns = None
In this case you don't get the error above, all columns are fetched, but compare the two dataframes, the one from pyarrow with to_pandas() and the one from the output of pandas read_csv(). You will notice that the first one has not parsed correctly the null values ('
N') in the last column catname. On the contrary pandas read_csv managed to parse all the null values correctly.

```python

Out[28]: 
   1082  991   16.5    200 2014-09-10  1  bar
0  1082  997   0.55  100.0 2014-09-10  1  bar
1  1082  998   7.95  200.0 2014-03-03  0   \N
2  1083  998  12.50    NaN        NaT  0  bar
3  1083  999   1.00    NaN        NaT  0  foo
4  1084  994  57.30  100.0 2014-12-20  1   \N
5  1084  995  22.20    NaN        NaT  0  foo
6  1084  998  48.60  200.0 2014-12-20  1  foo

Python code to test the attached CSV file for the bugs reported above


from pyarrow import csv
import pyarrow as pa
import pandas as pd

file_location = 'spc_catalog.tsv'

sep = '\t'
nulls=['\\N']

columns = ['catcost', 'catqnt', 'catdate', 'catchk', 'catname']
column_names = None
column_types = None

skip_rows = None
nrecords = None

csv.read_csv(file_location,
    parse_options=csv.ParseOptions(delimiter=sep),
    convert_options=csv.ConvertOptions(include_columns=columns, column_types=column_types, null_values=nulls),
    read_options=csv.ReadOptions(skip_rows=skip_rows, autogenerate_column_names=False, use_threads=True, column_names=column_names)
).to_pandas()

pd.read_csv(file_location, sep=sep, na_values='\\N', usecols=columns, nrows=nrecords, names=column_names, dtype=column_types)

Environment: Ubuntu bionic Reporter: Athanassios Hatzis Assignee: Joris Van den Bossche / @jorisvandenbossche

Original Issue Attachments:

Note: This issue was originally created as ARROW-7628. Please see the migration documentation for further details.

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: Thanks for reporting this issue. I'm answering your concerns below.

Point (1): this is the normal semantics of skip_rows. As the documentation says for ReadOptions.autogenerate_column_names:

If false, column names will be read from the first CSV row after skip_rows. (emphasis mine)

Point (2): by default, string columns are not checked for null values, because null values are also valid string values. But you can change this behaviour by passing strings_can_be_null=True to ConvertOptions. (again, see the documentation for ConvertOptions :-))

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: So I don't think there is an Arrow bug here. However, perhaps we can try to make these things easier to find out. cc @nealrichardson  any thoughts?

asfimport commented 4 years ago

Athanassios Hatzis: Thanks @pitrou for clearing these cases. Yes, I agree, it is a matter of semantics,

Point2: perhaps it would be better to set < strings_can_be_null=True > if the user specifies the < null_values > parameter.

Point1: I am confused with and options, in my example above, if I specify <column_names=['catcost', 'catqnt', 'catdate', 'catchk', 'catname'] > for ReadOptions and then you also get the following error,

pyarrow.lib.ArrowInvalid: CSV parse error: Expected 5 columns, got 7

So I guess the corner case in the example above is what is the right combination of parameters to read a subset of columns from CSV and also skip the first N lines of the file ?

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: Well, column_names has to map to the CSV file's columns, and the file has 7 columns, so you need to pass 7 names there.

include_columns will allow you to select which columns inside column_names are actually returned. Concretely, you probably want to pass something like:


read_options=csv.ReadOptions(column_names=['catsid', 'catpid', 'catcost', 'catqnt', 'catdate', 'catchk', 'catname'])
convert_options=csv.ConvertOptions(include_columns=['catcost', 'catqnt', 'catdate', 'catchk', 'catname'])
asfimport commented 4 years ago

Athanassios Hatzis: Yes, that works, thank you for your assistance

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche:

Point (1): this is the normal semantics of skip_rows. As the documentation says for ReadOptions.autogenerate_column_names:

If false, column names will be read from the first CSV row after skip_rows.

@pitrou but that seems to contradict with the documentation of skip_rows itself:

The number of rows to skip at the start of the CSV data, not including the row of column names (if any).

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: [~athanassios] Note that pandas.read_csv has the same behaviour regarding skiprows (skipping rows from the beginning of the file, so potentially loosing the header line). But in your code example, you didn't pass skip_rows (you used nrows, which has different behaviour).

So using your example, if I run:


In [16]: skip_rows = 10  

In [17]: pd.read_csv(file_location, sep=sep, na_values='\\N', skiprows=skip_rows) 
Out[17]: 
   1082  991   16.5    200  2014-09-10  1  bar
0  1082  997   0.55  100.0  2014-09-10  1  bar
1  1082  998   7.95  200.0  2014-03-03  0  NaN
2  1083  998  12.50    NaN         NaN  0  bar
3  1083  999   1.00    NaN         NaN  0  foo
4  1084  994  57.30  100.0  2014-12-20  1  NaN
5  1084  995  22.20    NaN         NaN  0  foo
6  1084  998  48.60  200.0  2014-12-20  1  foo

you see that it also looses the header file. And thus when specifying the column names to read, you also get an error:


In [18]: pd.read_csv(file_location, sep=sep, na_values='\\N', skiprows=skip_rows, usecols=columns)  
...
ValueError: Usecols do not match columns, columns expected but not found: ['catdate', 'catqnt', 'catname', 'catchk', 'catcost']
asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: I repurposed the issue to update the docs, will look into doing a PR.

asfimport commented 4 years ago

Francois Saint-Jacques / @fsaintjacques: Issue resolved by pull request 6463 https://github.com/apache/arrow/pull/6463