Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

FR: Allow `dec == sep` in `fread` to be applied to quoted data #6604

Open iago-pssjd opened 22 hours ago

iago-pssjd commented 22 hours ago

Found a csv file with data formatted as follows:

13800,10864,"27,03","3,2","9,8"

If I do not add arguments dec and sep columns are right, but quoted numeric data appears to be char. If I only specify dec = ",", columns are wrong (it separates by \t or blanks, instead of commas). If I try to specify dec = ",", sep = ",", then I get

 sep == dec (',') is not allowed

I'm conscious of the ambiguity of this instruction, since 13800,10864 could be just a decimal number, but then, I would set as assumption that the dec = "," only would apply to data inside quotes

ecoRoland2 commented 18 hours ago

Please show the output you get. I see this:

> fread(text = '13800,10864,"27,03","3,2","9,8"')
      V1    V2     V3     V4     V5
   <int> <int> <char> <char> <char>
1: 13800 10864  27,03    3,2    9,8

This is as expected, so I'm unsure where you see a bug. Quoted fields should be imported as character strings.

iago-pssjd commented 18 hours ago

Indeed, I meant quoted numeric data in V3 to V5, so I would like to get

> fread(text = '13800,10864,"27,03","3,2","9,8"')
      V1    V2     V3     V4     V5
   <int> <int>  <num>  <num>  <num>
1: 13800 10864  27.03    3.2    9.8
ecoRoland2 commented 17 hours ago

Well, then I think you are asking too much here. You have quoted strings but don't want to parse these as strings, which is against common convention. Ignoring that issue, you would have commas both as column separator and as decimal separator, which simply isn't a valid file format. I suggest, you post-process after importing, i.e., as.numeric(sub(",", ".", "27,03", fixed = TRUE)).