Open ajdamico opened 1 year ago
Thanks for the bug report. This indeed seems to be an issue with how our internal engine reads strings from within the R string pool. Hopefully we can get a fix in soon!
Thanks. Confirmed on macOS. I'm not sure it's an error though. You need valid UTF-8 strings, and DuckDB is more picky about it than R.
library(duckdb)
#> Loading required package: DBI
con <- dbConnect(duckdb::duckdb(), dbdir = "my-db.duckdb")
my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
dbWriteTable(con, "my_table", my_df)
#> Error: rapi_execute: Failed to run query
#> Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update
dbGetQuery(con, "SELECT 'Est\xe2ncia' AS x")
#> Error: Invalid unicode (byte sequence mismatch) detected in value construction
dbGetQuery(con, iconv("SELECT 'Est\xe2ncia' AS x", from = "latin1"))
#> x
#> 1 Estância
Created on 2023-12-02 with reprex v2.0.2
hi @krlmlr thanks for taking the time to look at this! i feel like data like this is pretty common, i wonder if you'd be willing to weigh in on how R users might approach this issue? for example, is the strategy below a good starting point for R users trying to import non-UTF 8 strings into duckdb?
# download and import official brazilian microdata #
library(httr)
library(archive)
library(readr)
library(duckdb)
tf <- tempfile()
this_url <- "https://download.inep.gov.br/microdados/microdados_enem_2022.zip"
GET( this_url , write_disk( tf ) , progress() )
archive_extract( tf , dir = tempdir() )
enem_fns <- list.files( tempdir() , recursive = TRUE , full.names = TRUE )
enem_fn <- grep( "MICRODADOS_ENEM_([0-9][0-9][0-9][0-9])\\.csv$" , enem_fns , value = TRUE )
enem_tbl <- read_csv2( enem_fn )
enem_df <- data.frame( enem_tbl )
names( enem_df ) <- tolower( names( enem_df ) )
# fails to import as-is, but succeeds after conversion #
converted_enem_df <- enem_df
converted_enem_df[ ] <-
lapply(
enem_df ,
function( x ){
if( class( x ) == 'character' ){
iconv( x , from = "latin1" )
} else x
}
)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
# fails
dbWriteTable( con , 'enade' , enade_df )
# works
dbWriteTable( con , 'enade' , converted_enem_df )
thanks a lot!!
Thanks. Cleaning up the encoding should really happen when reading the CSV file. With readr, see https://readr.tidyverse.org/articles/locales.html .
thank you!!
I've found that data can have been 'corrupted' downstream, meaning even if it was imported as UTF-8 this error message can occur. In which case @ajdamico's approach worked, I used dat <- dat |> mutate(across(where(is.character), stringi::stri_enc_tonative))
.
I'm unable to share the data, but hope this helps someone in future
hi, i'm still hitting this error on both the CRAN version and also the
duckdb_0.8.1-9000
dev version..console output: