astropy.io.ascii does not read pandas csv file correctly

astropy / astropy

Astronomy and astrophysics core library

https://www.astropy.org

BSD 3-Clause "New" or "Revised" License

4.45k stars 1.78k forks source link

astropy.io.ascii does not read pandas csv file correctly #6694

Open astrofrog opened 7 years ago

astrofrog commented 7 years ago

Pandas by default will write CSV files where the header for the first column is missing:

In [4]: from pandas import DataFrame

In [5]: df = DataFrame()

In [6]: df['a'] = [1,2,3]

In [7]: df.to_csv('test.csv')

In [8]: %more test.csv
,a
0,1
1,2
2,3

Whether or not this is sensible is debatable, but this means there are a lot of CSV files in the wild missing the first header column name. Astropy doesn't read these in correctly though:

In [10]: from astropy.io.ascii import read

In [11]: read('test.csv')
Out[11]: 
<Table masked=True length=4>
 col1 col2
int64 str1
----- ----
   --    a
    0    1
    1    2
    2    3

I think we might want to special case this, or deal better with cases like this given how common these kinds of files are going to be.

MSeifert04 commented 7 years ago

Interesting question. The first column is the index, so it doesn't make sense to read it in as "normal column". On the other hand that raises the question whether pandas user should write the index column to the csv if they want to use the csv in other programs...

drdavella commented 7 years ago

Doesn't the leading comma in the header row indicate that the first "column" is really the index? This means there should be a reliable way to detect this case. If instead you wrote the same file with df.to_csv('test.csv', index=False), then you should just see

In [8]: %more test.csv
a
1
2
3

pllim commented 7 years ago

At the same time, should pandas fix this on their side too?

drdavella commented 7 years ago

@pllim, I don't think it's a bug. I think it's the way that pandas indicates that the first column in a csv is an index, not a real data column. It's possible to give the index column a name, but I believe it's None by default. See the docs here:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

astrofrog commented 7 years ago

I've already seen examples in the wild where people are trying to read in these files with Astropy and failing. I know it's frustrating but I do think we should support this 'format'

drdavella commented 7 years ago

I could volunteer to look into this since I think I'd probably learn something new/useful. But if someone else already has a handle on a fix, that's okay too.