civisanalytics / civis-python

Civis API Python Client
BSD 3-Clause "New" or "Revised" License
34 stars 26 forks source link

`civis.io.dataframe_to_civis` doesn't auto-detect headers #263

Closed elsander closed 5 years ago

elsander commented 6 years ago

I used dataframe_to_civis on a dataframe with column names. The resulting table on Redshift added the header as an additional row in the table, and gave the table deafult column names (column_0, column_1). I think the expected behavior would be for the function to auto-detect the header and add it as the column names in the table. The real bug here is the fact that the function writes the header to file by default, but does not enforce the header being read as column names in Platform. In case it's relevant, the R client did automatically handle the header correctly for this same dataset, so there's also a discrepancy between the API clients.

keithing commented 6 years ago

Do you have a minimal working example of this failure? The client is relying on Platform to auto-detect headers, so there might be special conditions when it fails. This code below worked for me:

import pandas as pd
from civis.io import dataframe_to_civis, read_civis

table = TABLE
database = DATABASE

df = pd.DataFrame({'a': [1,2,3], 'b': ['hi', 'there', 'world']})
dataframe_to_civis(df, table=table, database=database).result()
df2 = read_civis(table=table, database=database, use_pandas=True)
assert all(df.columns == df2.columns)
elsander commented 6 years ago

Here's the code I ran where I got this result:

In [1]: import civis

In [2]: import pandas as pd

In [3]: reg = pd.read_csv('region_state.csv', header=0)

In [4]: reg
Out[4]:
       region  state
0        West     AK
1        West     HI
2        West     WA
3        West     OR
4        West     CA
5        West     NV
6        West     ID
7        West     MT
8        West     WY
9        West     UT
10       West     CO
11       West     AZ
12       West     NM
13    Midwest     ND
14    Midwest     SD
15    Midwest     NE
16    Midwest     KS
17    Midwest     MN
18    Midwest     IA
19    Midwest     MO
20    Midwest     WI
21    Midwest     IL
22    Midwest     IN
23    Midwest     MI
24    Midwest     OH
25      South     TX
26      South     OK
27      South     AR
28      South     LA
29      South     MS
30      South     AL
31      South     TN
32      South     KY
33      South     GA
34      South     FL
35      South     SC
36      South     NC
37      South     VA
38      South     MD
39      South     DE
40      South     WV
41      South     DC
42  Northeast     PA
43  Northeast     NJ
44  Northeast     CT
45  Northeast     RI
46  Northeast     NY
47  Northeast     MA
48  Northeast     VT
49  Northeast     NH
50  Northeast     ME
51      South  NC-SC
52      South  WV-OH
53      South  WV-KY
54    Midwest  ND-MN
55    Midwest  NE-IA
56    Midwest  IA-IL

In [5]: civis.io.dataframe_to_civis(reg, database='xyz', table='test.test')
Out[5]: <CivisFuture at 0x112fd5438 state=running>
elsander commented 6 years ago

FWIW, I successfully ran this before. The new things in this run vs old successful run are the multi-state rows at the bottom, and possibly a different version of the civis client? I think I did run this successfully on 1.9.0 before though.

elsander commented 6 years ago

Let me know if you have trouble reproducing, and I can send you a copy of the file.

stephen-hoover commented 6 years ago

I think that regardless of whether the import is buggy or not, we should deprecate the headers parameter of dataframe_to_civis and always pass True in the civis_file_to_table call. A user modifying that parameter would either have no effect or cause the import to break.

keithing commented 6 years ago

Let me know if you have trouble reproducing, and I can send you a copy of the file.

Thanks, that'd be helpful. The example above worked for me on v1.9.0.

I think that regardless of whether the import is buggy or not, we should deprecate the headers parameter

That seems reasonable to me.