ak-coram / cl-duckdb

Common Lisp CFFI wrapper around the DuckDB C API
https://github.com/ak-coram/cl-duckdb/blob/main/README.org
MIT License
36 stars 1 forks source link

String vectors as columns in a query? #37

Closed snunez1 closed 1 year ago

snunez1 commented 1 year ago

I am trying to create a database using the example in the README. The example is:

(ddb:initialize-default-connection) ; => #<DUCKDB::CONNECTION {10074E8BE3}>

;; Use vectors as columns in a query:
(let ((indexes (make-array '(10) :element-type '(unsigned-byte 8)
                                 :initial-contents '(1 2 3 4 5 6 7 8 9 10)))
      (primes (make-array '(10) :element-type '(unsigned-byte 8)
                                :initial-contents '(2 3 5 7 11 13 17 19 23 29))))
  (ddb:with-static-table ("primes" `(("i" . ,indexes)
                                     ("p" . ,primes)))
    (ddb:format-query "SELECT * FROM primes" nil)))

The code I'm trying to create should 'round trip' a data frame:

  1. Create a database and populate it from an alist
  2. Query with select * and get the original values back

In lisp-stat, the code is:

(ddb:with-static-table ("mtcars" (as-alist mtcars))
  (ddb:format-query "SELECT * from mtcars" nil))

mtcars is a common data set from R and loaded by default in lisp-stat. The first column is the model of the car, a string. However when the alist contains a vector of strings, I get:

#("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive"
  "Hornet Sportabout" "Valiant" "Duster 360" "Merc 240D"
  "Merc 230" "Merc 280" "Merc 280C" "Merc 450SE" "Merc 450SL"
  "Merc 450SLC" "Cadillac Fleetwood" "Lincoln Continental"
  "Chrysler Imperial" "Fiat 128" "Honda Civic" "Toyota Corolla"
  "Toyota Corona" "Dodge Challenger" "AMC Javelin" "Camaro Z28"
  "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" "Lotus Europa"
  "Ford Pantera L" "Ferrari Dino" "Maserati Bora"
  "Volvo 142E") fell through ETYPECASE expression.
Wanted one of ((SIMPLE-ARRAY BIT)
               (SIMPLE-ARRAY (UNSIGNED-BYTE 8))
               (SIMPLE-ARRAY (SIGNED-BYTE 8))
               (SIMPLE-ARRAY (UNSIGNED-BYTE 16))
               (SIMPLE-ARRAY (SIGNED-BYTE 16))
               (SIMPLE-ARRAY (UNSIGNED-BYTE 32))
               (SIMPLE-ARRAY (SIGNED-BYTE 32))
               (SIMPLE-ARRAY (UNSIGNED-BYTE 64))
               (SIMPLE-ARRAY (SIGNED-BYTE 64))
               (SIMPLE-ARRAY SINGLE-FLOAT)
               (SIMPLE-ARRAY DOUBLE-FLOAT)).
   [Condition of type SB-KERNEL:CASE-FAILURE]

with the source of the error being:

Backtrace:
  0: (DUCKDB-API::MAKE-STATIC-COLUMN MODEL #("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ...) NIL :LENGTH NIL)
      Locals:
        #:G0 = NIL
        #:G46 = NIL
        NAME = MODEL
        VALUES = #("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ...)
  1: (DUCKDB-API:MAKE-STATIC-COLUMNS ((MODEL . #("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ...)) (MPG . #(21 21 22.8d0 21.4d0 18.7d0 18.1d0 ...)) (CYL . #(6 6 ..
  2: ((LAMBDA ()))
  3: (SB-INT:SIMPLE-EVAL-IN-LEXENV (DUCKDB:WITH-STATIC-TABLE ("mtcars" (AS-ALIST MTCARS)) (DUCKDB:FORMAT-QUERY "SELECT * from mtcars" NIL)) #<NULL-LEXENV>)
  4: (EVAL (DUCKDB:WITH-STATIC-TABLE ("mtcars" (AS-ALIST MTCARS)) (DUCKDB:FORMAT-QUERY "SELECT * from mtcars" NIL)))

Any ideas?

ak-coram commented 1 year ago

Sorry, I didn't get around to implementing string vectors for static-tables yet, but a list of strings should work in the meantime.

You can find the currently supported types for both vectors and lists here: https://github.com/ak-coram/cl-duckdb#querying-lisp-vectors-and-lists-as-table-columns

ak-coram commented 1 year ago

Here's an example that uses a list of strings:

(ddb:with-static-table ("numbers" `(("i" . (,(loop :for i :below 10
                                                   :collect (format nil "~R" i))
                                            :duckdb-varchar))))
  (ddb:query "SELECT * FROM numbers" nil))
ak-coram commented 1 year ago

It shouldn't be too hard to process vectors that aren't any of the supported specialized types in a similar fashion to a list (but as with lists, you'll have to specify the column type for DuckDB).

snunez1 commented 1 year ago

Lists are terribly inefficient for large columns. So much so that Lisp-Stat doesn't even have an export option for columns-as-lists. What would it take to get string vectors into static tables? Encoding factors (categorical variables) as strings is rather common.

ak-coram commented 1 year ago

It wouldn't be hard at all, but you'd have to specify the DuckDB column type along with the vector (:duckdb-varchar in this case).

ak-coram commented 1 year ago

There are some other types where I'm not sure it can be done at all:

snunez1 commented 1 year ago

Ugh. Dates & times would be nice, the other types not really necessary.

Of course it can be done, looping across columns, gathering types, converting to vectors, etc., but it's rather frustrating to be so close to a seamless integration via a-list conversion, but not quite there.

ak-coram commented 1 year ago

PR #38 should allow you to use a vector of strings instead of the list (I'll merge once the CI is finished):

(ddb:with-static-table ("numbers" `(("i" . (,(make-array '(3) :initial-contents '("One" "Two" "Three"))
                                            :duckdb-varchar))))
  (ddb:query "SELECT * FROM numbers" nil))
snunez1 commented 1 year ago

Thank you! Is there a way to auto-detect the string type, so as to avoid the need for :duckdb-varchar parameter?

ak-coram commented 1 year ago

We could maybe start looking at the values in the vector, but I don't think that's a good idea for the driver to do (I don't think there's a nice, general way to make this work for every use case). It's not needed for the specialized vector types because for those we can infer the column type.

snunez1 commented 1 year ago

Perhaps I am not knowledgeable enough on duckdb vectors to understand, but it does appear to have a string vector type?

ak-coram commented 1 year ago

Yes, this is the target format we copy our values into (via some helper functions in the C API). The issue is on the CL side of things: when the vector itself is not specialized (i.e. via the :element-type keyword argument to make-array) we don't know that we need to deal with strings without looking at individual values in the vector (many of which could be nil). Also there are edge cases where you have a vector filled completely with nil: DuckDB still would need a column type for that.

If you want to implement a form of column type detection based on what values you have in your columns then you can do that already with the existing interface. As I mentioned I don't think we can do this in a way that works reasonably for every use case, so I don't think we should attempt it.

snunez1 commented 1 year ago

I see. In that case I'd lean toward having string as the default, if none of the other types were detected. Would that work as a practical, if imperfect, solution?

ak-coram commented 1 year ago

I think this might cause more issues than it would solve: for example the column type suddenly changing because there are no values in a vector might easily trip people up and cause errors in queries that otherwise worked fine. I'm not against implementing auto-detection, but I think it should be left out of the driver itself. If you have a narrower use case (e.g. integrating with a specific data-frame library), you should be able to do a better job at this.

ak-coram commented 1 year ago

Ugh. Dates & times would be nice, the other types not really necessary.

@snunez1, I thought I'd mention this here too: these should now also work (see #43)