cloudyr / bigQueryR

R Interface with Google BigQuery
http://code.markedmondson.me/bigQueryR
Other
41 stars 21 forks source link

Unable to specify schema for bqr_upload_data() for non-gs:// data sources #48

Closed RhysJackson closed 6 years ago

RhysJackson commented 6 years ago

https://github.com/cloudyr/bigQueryR/blob/af38d0e8de01e7c927e06a904f5a5dfd5a82ddb2/R/uploadData.R#L155

Example:

# Create data.frame with field of class Date
df <- data.frame(
  dates = seq.Date(Sys.Date()-9, Sys.Date(), by = 1),
  x = 1:10
)

# Attempt to append to an existing BigQuery table with the following schema:
# [{name: dates, type: DATE}, {name: x, type: INTEGER}]
bqr_upload_data(bqProject, bqDataset, bqTable, df)

# Returns:
Error: API returned: Provided Schema does not match Table myProject:myDataset.myTable. Field dates has changed type from DATE to TIMESTAMP

Explanation:

I'm unable to append data from a non-gs:// source to an existing BigQuery table due to a mismatch between the existing schema, and the schema produced by schema_fields().

Specifically, Date values are being interpreted as Timestamps in the schema being passed to BigQuery.

Is there any way of passing custom schema to bqr_upload_data() for non cloud storage data sources?

MarkEdmondson1234 commented 6 years ago

Not as its implemented now, but I suppose it makes sense to override it - but I think this is probably more convenient if all you needed to do was use R's Datetypes to make it BigQuery DATE, and leave R's POSIXct as TIMESTAMP

Would mean changing this https://github.com/cloudyr/bigQueryR/blob/af38d0e8de01e7c927e06a904f5a5dfd5a82ddb2/R/uploadData.R#L313-L326 to the below:

## From bigrquery
data_type <- function(x) {
  switch(class(x)[1],
         character = "STRING",
         logical = "BOOLEAN",
         numeric = "FLOAT",
         integer = "INTEGER",
         factor = "STRING",
         Date = "DATE", # <- changed this to DATE
         POSIXct = "TIMESTAMP",
         hms = "INTEGER",
         difftime = "INTEGER",
         stop("Unknown class ", paste0(class(x), collapse = "/"))
  )
}
MarkEdmondson1234 commented 6 years ago

Gave it a go, try it with the GitHub version. This may break existing uploads though for people.

MarkEdmondson1234 commented 6 years ago

...so to support that you can also now supply your own schema. schema_fields(upload_data) is exported so you can modify that result if required.

RhysJackson commented 6 years ago

Wow, that was quick! Thanks Mark.

With the first change I'm getting a different, but related error: Error while reading data, error message: Could not parse '1515628800' as date for field Date

Are the date fields getting coerced into POSIXct somewhere along the line?

I'll try the modified schema_fields() approach in just a moment.

MarkEdmondson1234 commented 6 years ago

Ah ok, I just saw this code:

standard_csv <- function(values) {
...

  is_date <- vapply(values, function(x) inherits(x, "Date"), logical(1))
  values[is_date] <- lapply(values[is_date], function(x) as.numeric(as.POSIXct(x)))
...
}

so looks like dates are changed to numerics in that line. I took that out, try it now.

RhysJackson commented 6 years ago

Amazing, that's done the trick.

Thank you so much. Strangely assert_that() couldn't be found, despite being a dependency. But explicitly loading it fixed the problem.