brry / rdwd

download climate data from DWD (German Weather Service)
https://bookdown.org/brry/rdwd
70 stars 12 forks source link

subdaily data not being read correctly #11

Closed PhilippBuehler closed 4 years ago

PhilippBuehler commented 4 years ago

Sub-daily datasets are not extracted correctly, as seen in my example.

library(rdwd)
link <- selectDWD(id=10381, res="subdaily", var="standard_format", per="historical")
file <- dataDWD(link, read=FALSE)
clim <- readDWD(file, varnames=FALSE)
brry commented 4 years ago

Hi, I'm on vacation and will try to have a look at it soon. Just to confirm we have the same errors, can you verify these?

For the first file (subdaily_standard_format_kl_10381_00_akt.txt) I get:

Error in names(x) <- value :  'names' attribute [51] must be the same length as the vector [9]

For the second file (subdaily_standard_format_kl_10381_bis_1999.txt.gz) I get:

Error: $ operator is invalid for atomic vectors
In addition: Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
  NAs introduced by coercion
2: In readheader(4, asnum = TRUE) : NAs introduced by coercion
3: In doTryCatch(return(expr), name, parentenv, handler) :
  NAs introduced by coercion
4: Error in readBin(confile, what = raw(), n = n, endian = "little") : 
  invalid 'n' argument
brry commented 4 years ago

I had a quick look and saw that the data format is entirely different from all other datasets. Can you help to develop a reading function? I won't have that amount of time this week nor early next week.

I understand the section "Considerations for applications" in the description file (1) on a quick glance to say that most (and much more) data should be available at ftp://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl
Does it make sense for you to have a look at that data? It is relatively well tested within rdwd.

(1) ftp://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/subdaily/standard_format/DESCRIPTION_obsgermany_climate_subdaily_standard_format_en.pdf

brry commented 4 years ago

I couldn't resist trying some stuff. Here's the intermediate results. Let me know if this is in the direction you need. You'll need the very recent dev version of rdwd for the recent/historical URL selection.

# towards readDWD.stand

library(rdwd)
link <- selectDWD(id=10381, res="subdaily", var="standard_format", per="r")
file <- dataDWD(link, dir=localtestdir(), read=FALSE)
# sf <- readDWD(file, varnames=FALSE)

# read format description:
format_url <- paste0(dwdbase,"/subdaily/standard_format/formate_kl.html")
format_html <- readLines(format_url, encoding="UTF-8")
format_html <- gsub("&nbsp;", "", format_html)
format_html <- gsub("&#176;", "deg", format_html)
format_html <- format_html[!grepl("Formatbeschreibung", format_html)]
rdwd:::checkSuggestedPackage("XML", "readDWD.stand")
format_info <- XML::readHTMLTable(doc=format_html, header=TRUE, stringsAsFactors=FALSE)[[1]]
# get column widths:
width <- diff(as.numeric(format_info$Pos))
width <- c(width, 200)
# ToDo: consider moving to a function, caching/saving output (updated along with indexes?)

# read fixed width dataset
sf <- read.fwf(file, widths=width, na.strings="-999")#, n=10) # this takes >20secs to read
# ToDo: definitely look at data.table/vroom/... for speedup here!
colnames(sf) <- format_info$Label
sf$Date <- as.Date(paste(sf$JA,sf$MO,sf$TA,sep="-"), "%F")
for(i in c("P1","P2","P3","PM","TXK","TNK","TRK","TGK","T1","T2","T3","TMK",
           "TF1","TF2","TF3","VP1","VP2","VP3","VPM", "FMK","NM","SDK",
           "R1","R2","R3","RSK", "FXK","ASH","WAAS","WASH"))
  sf[,i] <- sf[,i]/10
plot(sf$Date, sf$SHK, type="l") # ToDo: NA corrections with  format_info$Fehlk

# Plot all columns:
{
pdf("sf_columns.pdf", width=9)
par(mfrow=c(2,1),mar=c(2,3,2,0.1), mgp=c(3,0.7,0), las=1)
for(i in 2:ncol(sf)-1) plot(sf$Date, sf[,i], type="l", main=colnames(sf)[i], ylab="")
dev.off()
}
openFile("sf_columns.pdf")
PhilippBuehler commented 4 years ago

Pretty nice work and coding, good job!

It works well but keep in mind a few things:

Also, feel free to use my data.table approach (with help from here: https://stackoverflow.com/q/24715894/6358363):

library(stringi)
library(data.table)
cols = list(beg = as.numeric(format_info$Pos),  end = as.numeric(format_info$Pos)[-1L] - 1L)
sf2 <-  fread(file[2], sep = "\n", header = FALSE)
sf2 <- sf2[ , lapply(seq_len(length(cols$beg)), function(ii) stri_sub(V1, cols$beg[ii], cols$end[ii]))]
colnames(sf) <- format_info$Label
brry commented 4 years ago

So I did find some time already. With the development version, things seem to read fine both for current and historical data. https://github.com/brry/rdwd#latest-version

See the examples at https://github.com/brry/rdwd/blob/master/man/readDWD.stand.Rd#L41-L60

I didn't see the data.table idea yet and kept speed as a ToDo for the moment. Hope to find time for that soon. It optimally handles .gz files directly as read.fwf does

brry commented 4 years ago

As of Version 1.1.25, reading speed is now fine too. I use readr, which is about as fast as data.table, but already returns numerical columns where appropriate. Can you help out by testing the whole thing on some more files?

PhilippBuehler commented 4 years ago

Works properly and is also fast! Thanks