duckdb / duckdb-r

The duckdb R package
https://r.duckdb.org/
Other
106 stars 21 forks source link

Request for duckdb_read_csv2 function and clarification on data types in duckdb_read_csv #118

Open VincentGuyader opened 3 months ago

VincentGuyader commented 3 months ago

Hello everyone,

I have encountered an issue while trying to import a CSV file using DuckDB in R. I am using the duckdb_read_csv function to read the CSV file into a DuckDB database. However, I noticed that the function does not accurately detect the data types of the columns.

library(DBI)
library(duckdb)
library(readr)

con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:", read_only = FALSE)
readr::write_csv2(iris, "iris.csv")
duckdb::duckdb_read_csv(con,
                        name = "iris_data1",
                        files = 'iris.csv',
                        sep = ";",
                        dec = "."
)
iris_db1 <- dbReadTable(con, "iris_data1")
is.numeric(iris_db1$Sepal.Length)

The issue arises when I try to check whether a column is numeric using is.numeric. Even though the column 'Sepal.Length' should be numeric, the result of is.numeric(iris_db1$Sepal.Length) is FALSE.

I believe one potential solution to this issue would be to have a new function duckdb_read_csv2 that allows users to specify column data types explicitly or to improve the data type detection in duckdb_read_csv.

Therefore, I would like to request the addition of a duckdb_read_csv2 function with enhanced data type detection capabilities or the improvement of data type detection in the existing duckdb_read_csv function. it seems that the parameters delim, dec, and sep are not being recognized by the duckdb_read_csv function

Regards

krlmlr commented 3 months ago

Thanks. This is what I see:

library(DBI)
library(duckdb)
library(readr)

con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:", read_only = FALSE)
readr::write_csv2(iris, "iris.csv")
duckdb::duckdb_read_csv(con, name = "iris_data1", files = "iris.csv")
#> Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
iris_db1 <- dbReadTable(con, "iris_data1")
#> Error: rapi_prepare: Failed to prepare query SELECT * FROM iris_data1
#> Error: Catalog Error: Table with name iris_data1 does not exist!
#> Did you mean "temp.information_schema.schemata"?
#> LINE 1: SELECT * FROM iris_data1
#>                       ^
iris_db1
#> Error in eval(expr, envir, enclos): object 'iris_db1' not found

Created on 2024-03-23 with reprex v2.1.0

Can you please fix the example?

VincentGuyader commented 3 months ago

thanks. I have edited the Reprex

krlmlr commented 3 months ago

Thanks, I see it now.

The decimal separator seems to be fixed to "." in DuckDB's COPY TO, see https://duckdb.org/docs/sql/statements/copy.html. We'll need to support that in the DuckDB core before being able to do anything here.

Your best bet may be to use more mundane methods to convert the CSV data to Parquet, and to use that from DuckDB.

krlmlr commented 2 months ago

I now see in https://duckdb.org/docs/data/csv/overview that decimal_separator is a supported setting. Would you like to contribute a PR?