Select header row number when reading CSV files

JonGretar commented 11 months ago

It would be helpful to add a :header_row option to the reading of CSV files. And that this is separate from the :skip_rows option. It is not uncommon, especially when working with scientific equipment that the header might not be in the first row and also that there might be non-data rows after it.

As an example I point to the eddy covariance data example.

"TOA5","6843","CR3000","6843","CR3000.Std.22","CPU:CA_Flux__GOOD.CR3","24006","ts_Above"
"TIMESTAMP","RECORD","Ux","Uy","Uz","co2","h2o","Ts","press","diag_csat"
"TS","RN","m/s","m/s","m/s","mg/m^3","g/m^3","C","kPa","m/s"
"","","Smp","Smp","Smp","Smp","Smp","Smp","Smp","Smp"
"2012-06-07 13:00:00.05",111868400,0.468,-0.9077501,0.1785,659.7584,9.530561,28.52527,100.1938,0
"2012-06-07 13:00:00.1",111868401,0.60275,-1.0795,0.283,660.0234,9.492132,28.51141,100.1938,0
....

Here the first row is data about the equipment.
The second row is the column names.
Third row are the units.
Fourth is other metadata
And then the data finally starts.

Of course reading this is not complex. Just use skip_rows: 1 and then delete the first two rows in the dataframe. But this is such a common pattern in scientific data that it might be worth considering supporting it inside the read_csv/2 function.

Of course I would also love to be able to save the units row as a series attribute. But that is a discussion for another issue. 😉

josevalim commented 11 months ago

If this is supported in polars, then :+1: for a PR that adds this.

JonGretar commented 11 months ago

Hmmm

Polars has 'skip_rows_after_header'. I'll take a look at adding that.

elixir-explorer / explorer

Select header row number when reading CSV files #781