Changing headers after `df-read/csv`

Metaxal commented 3 years ago

Hi Alex,

Is there a way to change the headers of the resulting data-frame after df-read/csv? My csv file contains only the raw data without headers, and I would like to add them manually.

Thanks for the good work and the great documentation, Laurent

alex-hhh commented 3 years ago

There is currently no way to rename the headers, since I have not encountered this case before. Do you have any suggestions on how this functionality should be presented to the user? (what the API would look like?)

alex-hhh commented 3 years ago

Here is how it could be implemented outside of the data-frame package:

#lang racket
(require data-frame)

;; Generate some sample data, a CSV file with no headers

(define sample-data #<<EOS
1,2,3
4,5,6
7,8,9
EOS
  )

(call-with-output-file "./data.csv" #:exists 'replace
  (lambda (out)
    (write-string sample-data out)))

;; Read the data in
(define df (df-read/csv "./data.csv" #:headers? #f))

;; Series are named "col1", "col2" and "col3" -- default for data with no
;; headers
(df-describe df)

;; A function to read a CSV file but supply the headers.  NOTE: no
;; verification is done in this function, in particular, the header count must
;; match the column count in the data file.
(define (df-read/csv/with-headers input headers)
  (define header-row (string-append (string-join headers ",") "\n"))
  (define header-input (open-input-string header-row))
  (define data-input (if (input-port? input) input (open-input-file input)))
  (define ninput (input-port-append #t header-input data-input))
  (df-read/csv ninput #:headers? #t))

;; Read the data by supplying our own headers
(define dfh (df-read/csv/with-headers "./data.csv" '("First" "Second" "Third")))

;; The new data frame has the user supplied headers.
(df-describe dfh)

Metaxal commented 3 years ago

Thanks, that's one way to do it indeed. sawzall also has a rename utility, which I suspect is based on duplicating data-frame columns, which is not ideal.

Another use case I have is renaming an existing column from a csv file, because the corresponding column name in the (autogenerated) csv file is terrible.

So probably the simplest API would just be a df-set-series-names! or something like this. One challenge, IIUC, is that in data-frame the columns are unordered by default, so the CSV order may not be the actual order in the frame? (It could make sense to impose an order though, possibly optionally.)

Otherwise, a keyword argument #:column-names to read-csv could work, either to add names if none exist in the file, or replace names if they already exist. In the latter case, if the user merely wants to change one or a few column names, it may be annoying.

Another option is to support something like sawzall's rename directly, without column duplication.

alex-hhh commented 3 years ago

You can also replace headers in a CSV file using the same technique I showed here https://github.com/alex-hhh/data-frame/issues/10#issuecomment-928464848, by reading the first line from data-input to discard the old headers. In general, this technique is more flexible, since there are all sorts of weird CSV files out there, and it is costly to support these cases using parameters to df-read/csv -- one type of CSV file I had to read had a few "key-value" pairs in the first few lines, before the proper CSV table started...

A df-rename-series! function could also be implemented, but it would have to materialize all derived series and be careful to update the secondary indexes as well...

As for the series not being ordered inside a data-frame, this is intentional, and will stay this way. The columns are read from CSV files correctly, mapping headers to series names, or, if there are no headers, creating columns named "col1", "col2", etc. where "col1" would be the first column in the CSV file. When you write CSV files, you can also specify a list of columns and the data will be written in the order you specify. However, inside the data frame, columns are not ordered.

Metaxal commented 3 years ago

You can also replace headers in a CSV file using the same technique I showed here #10 (comment), by reading the first line from data-input to discard the old headers. In general, this technique is more flexible, since there are all sorts of weird CSV files out there, and it is costly to support these cases using parameters to df-read/csv -- one type of CSV file I had to read had a few "key-value" pairs in the first few lines, before the proper CSV table started...

Skipping lines should be fine indeed. I'm more worried about adding a header, this feels a little too much like a hack.

A df-rename-series! function could also be implemented, but it would have to materialize all derived series and be careful to update the secondary indexes as well...

As for the series not being ordered inside a data-frame, this is intentional, and will stay this way. The columns are read from CSV files correctly, mapping headers to series names, or, if there are no headers, creating columns named "col1", "col2", etc. where "col1" would be the first column in the CSV file. When you write CSV files, you can also specify a list of columns and the data will be written in the order you specify. However, inside the data frame, columns are not ordered.

Thanks for the info. I'm closing the issue as there's a workaround for now.

alex-hhh commented 3 years ago

I added a df-rename-series! operation on data frames, as it seems to be a useful feature.

Metaxal commented 3 years ago

Awesome, thanks Alex! (cc @ralsei )

alex-hhh / data-frame

Changing headers after `df-read/csv` #10