edwindj / cbsodataR

Statistics Netherlands (CBS) OpenData API Client for R
https://edwindj.github.io/cbsodataR
33 stars 12 forks source link

Strip whitespace #4

Closed J535D165 closed 7 years ago

J535D165 commented 7 years ago

It seems to be a good idea to strip whitespace. See the output below

> t <- get_data('37556')
Writing TableInfos.csv...
Writing DataProperties.csv...
Writing CategoryGroups.csv...
Writing Perioden.csv...
Retrieving data from table '37556'
Done!
> t$Mannen_2
  [1] "       ." "    2521" "    2550" "    2584" "    2622" "    2660" "    2699" "    2737"
  [9] "    2777" "    2817" "    2855" "    2899" "    2945" "    2987" "    3037" "    3088"
 [17] "    3141" "    3188" "    3236" "    3282" "    3311" "    3352" "    3410" "    3465"
 [25] "    3516" "    3574" "    3629" "    3683" "    3735" "    3785" "    3838" "    3886"
 [33] "    3943" "    4006" "    4068" "    4124" "    4177" "    4221" "    4264" "    4307"
 [41] "    4353" "    4408" "    4454" "    4497" "    4530" "    4558" "    4603" "    4634"
 [49] "    4748" "    4838" "    4926" "    4998" "    5084" "    5146" "    5198" "    5256"
 [57] "    5321" "    5391" "    5460" "    5529" "    5619" "    5686" "    5754" "    5838"
 [65] "    5924" "    6001" "    6091" "    6178" "    6262" "    6317" "    6383" "    6465"
 [73] "    6550" "    6624" "    6676" "    6722" "    6772" "    6837" "    6872" "    6907"
 [81] "    6945" "    6994" "    7048" "    7082" "    7103" "    7124" "    7150" "    7185"
 [89] "    7224" "    7274" "    7317" "    7358" "    7420" "    7480" "    7535" "    7586"
 [97] "    7627" "    7662" "    7697" "    7740" "    7793" "    7846" "    7910" "    7972"
[105] "    8015" "    8046" "    8066" "    8077" "    8089" "    8112" "    8156" "    8203"
[113] "    8243" "    8283" "    8307" "    8334" "    8373" "    8417"

Maybe add strip.white=TRUE to line https://github.com/edwindj/cbsodataR/blob/master/R/get-data.R#L31? Not tested.

edwindj commented 7 years ago

Another option would be to transform data directly into numeric/integer. Any preferences?

J535D165 commented 7 years ago

Thanks for the reply!

I like the idea of storing the data in the way it was received. No changes to the datatypes.

Maybe we can import all columns with read.csv without setting colClasses="character", except those columns that need denormalization. It is something like:

meta.names <- names(meta)
meta.types <- rep("character", times = length(meta.names))
names(meta.types) <- meta.names
read.csv('path_to_data.csv', strip.white = TRUE, colClasses = meta.types, strip.white=TRUE)

The warnings it is generating for non-existing columns can be suppressed.

I don't know what to do with the 'denormalization' columns itself. They remain character typed.