duckdb / duckdb-r

The duckdb R package
https://r.duckdb.org/
Other
129 stars 23 forks source link

Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update #12

Open ajdamico opened 1 year ago

ajdamico commented 1 year ago

hi, i'm still hitting this error on both the CRAN version and also the duckdb_0.8.1-9000 dev version..

# using github
remotes::install_github("duckdb/duckdb-r")

# minimal reproducible example
library(duckdb)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
dbWriteTable( con , 'my_table' , my_df )

console output:

> library(duckdb)
Loading required package: DBI
> con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
> my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
> dbWriteTable( con , 'my_table' , my_df )
Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update
In addition: Warning message:
Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this. 
> 
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] duckdb_0.8.1-9000 DBI_1.1.3        

loaded via a namespace (and not attached):
[1] compiler_4.3.1 tools_4.3.1   
> 
Tmonster commented 1 year ago

Thanks for the bug report. This indeed seems to be an issue with how our internal engine reads strings from within the R string pool. Hopefully we can get a fix in soon!

krlmlr commented 11 months ago

Thanks. Confirmed on macOS. I'm not sure it's an error though. You need valid UTF-8 strings, and DuckDB is more picky about it than R.

library(duckdb)
#> Loading required package: DBI
con <- dbConnect(duckdb::duckdb(), dbdir = "my-db.duckdb")
my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
dbWriteTable(con, "my_table", my_df)
#> Error: rapi_execute: Failed to run query
#> Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update

dbGetQuery(con, "SELECT 'Est\xe2ncia' AS x")
#> Error: Invalid unicode (byte sequence mismatch) detected in value construction
dbGetQuery(con, iconv("SELECT 'Est\xe2ncia' AS x", from = "latin1"))
#>          x
#> 1 Estância

Created on 2023-12-02 with reprex v2.0.2

ajdamico commented 11 months ago

hi @krlmlr thanks for taking the time to look at this! i feel like data like this is pretty common, i wonder if you'd be willing to weigh in on how R users might approach this issue? for example, is the strategy below a good starting point for R users trying to import non-UTF 8 strings into duckdb?

# download and import official brazilian microdata #

library(httr)
library(archive)
library(readr)
library(duckdb)

tf <- tempfile()

this_url <- "https://download.inep.gov.br/microdados/microdados_enem_2022.zip"

GET( this_url , write_disk( tf ) , progress() )

archive_extract( tf , dir = tempdir() )

enem_fns <- list.files( tempdir() , recursive = TRUE , full.names = TRUE )

enem_fn <- grep( "MICRODADOS_ENEM_([0-9][0-9][0-9][0-9])\\.csv$" , enem_fns , value = TRUE )

enem_tbl <- read_csv2( enem_fn )

enem_df <- data.frame( enem_tbl )

names( enem_df ) <- tolower( names( enem_df ) )

# fails to import as-is, but succeeds after conversion #

converted_enem_df <- enem_df

converted_enem_df[ ] <-
    lapply( 
        enem_df , 
        function( x ){
            if( class( x ) == 'character' ){
                iconv( x , from = "latin1" )
            } else x
        }
    )

con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )

# fails
dbWriteTable( con , 'enade' , enade_df )

# works
dbWriteTable( con , 'enade' , converted_enem_df )

thanks a lot!!

krlmlr commented 11 months ago

Thanks. Cleaning up the encoding should really happen when reading the CSV file. With readr, see https://readr.tidyverse.org/articles/locales.html .

ajdamico commented 11 months ago

thank you!!

eli-daniels commented 4 weeks ago

I've found that data can have been 'corrupted' downstream, meaning even if it was imported as UTF-8 this error message can occur. In which case @ajdamico's approach worked, I used dat <- dat |> mutate(across(where(is.character), stringi::stri_enc_tonative)). I'm unable to share the data, but hope this helps someone in future