Closed snunez1 closed 1 year ago
Sorry, I didn't get around to implementing string vectors for static-tables yet, but a list of strings should work in the meantime.
You can find the currently supported types for both vectors and lists here: https://github.com/ak-coram/cl-duckdb#querying-lisp-vectors-and-lists-as-table-columns
Here's an example that uses a list of strings:
(ddb:with-static-table ("numbers" `(("i" . (,(loop :for i :below 10
:collect (format nil "~R" i))
:duckdb-varchar))))
(ddb:query "SELECT * FROM numbers" nil))
It shouldn't be too hard to process vectors that aren't any of the supported specialized types in a similar fashion to a list (but as with lists, you'll have to specify the column type for DuckDB).
Lists are terribly inefficient for large columns. So much so that Lisp-Stat doesn't even have an export option for columns-as-lists. What would it take to get string vectors into static tables? Encoding factors (categorical variables) as strings is rather common.
It wouldn't be hard at all, but you'd have to specify the DuckDB column type along with the vector (:duckdb-varchar
in this case).
There are some other types where I'm not sure it can be done at all:
Ugh. Dates & times would be nice, the other types not really necessary.
Of course it can be done, looping across columns, gathering types, converting to vectors, etc., but it's rather frustrating to be so close to a seamless integration via a-list conversion, but not quite there.
PR #38 should allow you to use a vector of strings instead of the list (I'll merge once the CI is finished):
(ddb:with-static-table ("numbers" `(("i" . (,(make-array '(3) :initial-contents '("One" "Two" "Three"))
:duckdb-varchar))))
(ddb:query "SELECT * FROM numbers" nil))
Thank you! Is there a way to auto-detect the string type, so as to avoid the need for :duckdb-varchar
parameter?
We could maybe start looking at the values in the vector, but I don't think that's a good idea for the driver to do (I don't think there's a nice, general way to make this work for every use case). It's not needed for the specialized vector types because for those we can infer the column type.
Perhaps I am not knowledgeable enough on duckdb vectors to understand, but it does appear to have a string vector type?
Yes, this is the target format we copy our values into (via some helper functions in the C API). The issue is on the CL side of things: when the vector itself is not specialized (i.e. via the :element-type
keyword argument to make-array
) we don't know that we need to deal with strings without looking at individual values in the vector (many of which could be nil). Also there are edge cases where you have a vector filled completely with nil: DuckDB still would need a column type for that.
If you want to implement a form of column type detection based on what values you have in your columns then you can do that already with the existing interface. As I mentioned I don't think we can do this in a way that works reasonably for every use case, so I don't think we should attempt it.
I see. In that case I'd lean toward having string as the default, if none of the other types were detected. Would that work as a practical, if imperfect, solution?
I think this might cause more issues than it would solve: for example the column type suddenly changing because there are no values in a vector might easily trip people up and cause errors in queries that otherwise worked fine. I'm not against implementing auto-detection, but I think it should be left out of the driver itself. If you have a narrower use case (e.g. integrating with a specific data-frame library), you should be able to do a better job at this.
Ugh. Dates & times would be nice, the other types not really necessary.
@snunez1, I thought I'd mention this here too: these should now also work (see #43)
I am trying to create a database using the example in the README. The example is:
The code I'm trying to create should 'round trip' a data frame:
In lisp-stat, the code is:
mtcars
is a common data set from R and loaded by default in lisp-stat. The first column is the model of the car, a string. However when the alist contains a vector of strings, I get:with the source of the error being:
Any ideas?