alex-hhh / data-frame

A data frame implementation for Racket
https://alex-hhh.github.io/2018/08/racket-data-frame-package.html
Other
37 stars 10 forks source link

df-read/x for x separated data file #11

Closed md-arif-shaikh closed 2 years ago

md-arif-shaikh commented 3 years ago

Hi,

Is there an existing function to read data from a file where the data is separated by some string x other than ,? For example, x=" ".

alex-hhh commented 3 years ago

Currently, there is no such function. The field separator is hardcoded to , in df-read/csv and df-write/csv, which happens here: https://github.com/alex-hhh/data-frame/blob/475dc4c8b2d70ae704f6c634dcc77b09eae65d46/private/csv.rkt#L157

Unfortunately, changing the separator to space, as you suggested, would create problems, since the CSV parser will trim white space from around cells. Perhaps writing a separate function is the right way to go about it...

In any case, writing df-read/... functions is not special, and you can write your own by using functions provided by this package.

md-arif-shaikh commented 3 years ago

Hi @alex-hhh I have been fiddling with this for some time and I see that in the csv.rkt file if I replace the hardcoded , by an argument to the function as #:sep (sep #\,) then I can write to and read from comma tab, space separated files correctly. I am thinking of a new function as you suggested. Would you accept a PR if I find this to work after a bit of more experiment? Also, what would be a good name for such a function? What kind of test would you suggest I put this function to?

alex-hhh commented 3 years ago

What is the data format you are trying to read? Is this specified somewhere?

The problem I see is that, when you replace the , with a space as the separator, than the separator is one single space, and this means that a file like the following would be read as a table of 4 columns and 2 rows since there are several spaces between "1" and 2":

1   2
11  12

that is, the above would be equivalent to the following CSV file, even though the user probably intended to have a table of 2 columns.

1,,,2
11,,12,
md-arif-shaikh commented 3 years ago

Sorry, I should have been more specific, by space I meant a single space which is equivalent to #\space. I can try to see if I can make it work with multiple spaces. But I think extending sep from only , to (#\, #\space #\tab) might be very useful in itself.

alex-hhh commented 3 years ago

Extending the code to accept multiple spaces would not work well either, since now empty cells could no longer be represented. The fact is that using space as a separator is problematic. Tab is the same, since many editors expand tabs or combine multiple spaces into tabs or use a mixture of both, which results in tabular data which looks OK in a text editor, but it is problematic to read.

I understand that changing the separator to a single space worked for your data set, which happened to have a single space for numbers, but it would not work in the general case...

This is why I asked if you have some specification for the data format you are trying to read?

For example, Excel handles these types of files by allowing the user to specify the tab width and the column number where each cell starts -- this approach would require a completely different parser than df-read/csv with a space separator...

md-arif-shaikh commented 3 years ago

I think I get it. I usually create my own dataset, say using numerical simulation, so I have full control over the data format and therefore it works for me. But, yes, as you said for the general-purpose usage it would cause problems.