Lisp-Stat / data-frame

Data frames for Common Lisp
https://lisp-stat.github.io/data-frame
Microsoft Public License
26 stars 4 forks source link

Dataframes from PostgresQL queries #28

Open nathanvy opened 1 week ago

nathanvy commented 1 week ago

Hi,

In my application I would like to query a Postgres database using postmodern and then perform some analysis with lisp-stat. postmodern returns data as a list of rows, each of those a list. For example for a table FOO with columns id and bar, a select statement might return:

select * from FOO;

   id   |  bar
--------+--------
   1    |  blah
   2    |  adsf
   3    |  jkjlkj
   ...

And in lisp that looks like:

'((1 "blah")
  (2 "asdf")
  (3 "jklkj")
  ... 
 )

I would like to get the list that postmodern returns into a lisp-stat data frame. Is there a convenient way to do this? Could I just walk the list from postgres using loop and then build up a dataframe row-by-row?

snunez1 commented 1 week ago

The way to do that is to use vector-row-reader. It was created specifically for this purpose, but doesn't appear to have made it into the cl-postgres documentation. If it does what you like, a pull request in the docs will be greatly appreciated (either in cl-postgres and/or here in data-frames. If in data frames, send me the paragraphs and I'll insert so you don't need the whole doc toolchain.)

See marijnh/postmodern#272 for the discussion that led to vector-row-reader. I see there is an alist-row-reader that might be worth checking out.

nathanvy commented 1 week ago

Thanks, I'll check that out ASAP and get back to you.

snunez1 commented 6 days ago

That's what, I think, the pivot table is for (refer to the issue in Postmodern). Pivot rows->columns, then use vector-row-reader to get a vector of column vectors. From there you can use the data-frame constructors.

nathanvy commented 6 days ago

(Sorry, deleted my previous reply, probably shouldn't have)

I don't think pivoting does what I need; it's not just rows->columns.

You can see from the crosstab tutorial that it's creating columns based on the number of distinct values in a particular column of the original data set. That's valid for some use cases but my data is already in postgres exactly how I want it represented in the data-frame so a pivot operation will actually distort things.

For clarity: My data is financial OHLCV data, so what I really want is a column vector for each of Timestamp/Open/High/Low/Close/Volume. If I pivot on any of those it'll just make things a mess. I'm starting to think I will have to massage it by hand in CL; not a huge deal, it's just a lot of allocations I was hoping to avoid.

snunez1 commented 6 days ago

This just 'feels' like it should be simpler than we're making it. Before doing it by hand in CL, perhaps re-open the issue in the Post-modern repo. Sabre and I talked about this before and he seems to think it's easy. He's a helpful guy.

nathanvy commented 6 days ago

This just 'feels' like it should be simpler than we're making it.

It definitely does, but SQL is fundamentally row-based so I'm starting to think it might not be possible unless there's a Postgres feature I'm not aware of.

In case anyone else lands here via google, here's what I'm using as a workaround in the interim:

;; wherein postmodern, via the :vectors results style, is returning
;; a vector-of-vectors where each vector/row is of the form:
;; #(id date-time-group open high low close volume)
(defun query-to-df (ohlcv-data)
  (let ((n (length ohlcv-data)))
    (let ((open (make-array n))
          (high (make-array n))
          (low (make-array n))
          (close (make-array n))
          (dtg (make-array n))
          (keys #(:dtg :open :high :low :close)))
      (loop for row across ohlcv-data
            for i from 0
            do (progn
                 (setf (elt open i) (elt row 2))
                 (setf (elt high i) (elt row 3))
                 (setf (elt low i) (elt row 4))
                 (setf (elt close i) (elt row 5))
                 (setf (elt dtg i) (elt row 1))))
      (df:make-df keys (list dtg open high low close)))))
sabracrolleton commented 5 days ago

Curious whether this works for you.

Assume ohlcv-data is returned from a query that looks like (with relevant where clause):

(pomo:query "select dtg, open, high, low, close from ohlcv")

Is the following equivalent to your function above?

(defun query-to-df (ohlcv-data)
   (let ((keys #(:dtg :open :high :low :close)))
       (df:make-df keys (loop for x in (apply #'mapcar #'list ohlcv-data)
                                                   collect (coerce x 'vector)))))
snunez1 commented 5 days ago

@sabracrolleton, is there a way to get the results of a query as a row-major array? If so then aops:each-index could be used to transpose the matrix:

(defparameter A #2A((1 2)
                    (3 4)))
;; Transpose
(aops:each-index (i j) (aref A j i)) ; => #2A((1 3)
                                     ;        (2 4))
sabracrolleton commented 5 days ago

I have to admit I am getting a little confused between when you are talking sets (sql results) and arrays. Postgresql arrays can only have a single data type and it is not clear to me that the dtg column is the same data type as the open, high, low, close columns.

(query "select dtg, open, high, low, close from ohlcv" :vectors)

is going to return (just using placeholders here)

#(#(dtg1 open1 high1 low1 close1) 
  #(dtg2 open2 high2 low2 close2)
  #(dtg3 open3 high3 low3 close3) 
  #(dtg4 open4 high4 low4 close4))

but the work was done in the Postmodern cl-postgres::vector-row-reader function, not in Postgresql and I do not think it is in the order @nathanvy wants. My query-to-df function suggestion above obviously does not use the vector-row-reader function, it just transposes the rows first, then coerces them into vectors. I think it does the same thing as @nathanvy 's function, but I did not test the efficiency of either.

In terms of thinking about what Postgresql's built in functions could do, if dtg was the same data type as the others, then Postgresql's array_agg function looking something like:

(query "select array_agg(array[dtg, open,high,low,close]) from ohlcv")

would return something like:

    ((#2A((dtg1 open1 high1 low1 close1) 
          (dtg2 open2 high2 low2 close2)
          (dtg3 open3 high3 low3 close3) 
          (dtg4 open4 high4 low4 close4))))

but I do not think that is in the order you want.

If dtg is not the same data type as open, high, low, close, then it cannot be in the Postgresql array. So something like:

(query "select dtg, array_agg(array[open,high,low,close]) from ohlcv group by dtg")

would result in something that is even less what you want:

    ((dtg1
      #2A((open1 high1 low1 close1)))
     (dtg2
      #2A((open2 high2 low2 close2)))
     (dtg3
      #2A((open3 high3 low3 close3)))
     (dtg4
      #2A((open4 high4 low4 close4))))

You might think about putting open, high, low and close into a Postgresql array (changing the database table structure), but that would have its own pluses and minuses.

I am not sure whether any of this is helpful or if I have just confused things more.

snunez1 commented 5 days ago

Ah, I seem to remember the homogeneous type requirement being an issue in previous discussions. Perhaps this is the right track then for a generic table->data-frame converter. Loop over all the column names and get them one by one as vectors. It's rare that a data-frame, in an analytics scenario, will have all the same data types.

nathanvy commented 5 days ago

Okay, weighing in on mobile so please excuse formatting.

  1. I appreciate both of you giving this time and attention, thank you.

  2. I'm an ex military guy and it's just a habit that's hard to break: DTG stands for Date/Time Group and in my application it's a time stamp from the local-time library. Open/high/low/close are double precision floats (I know I know, don't use floats to represent money, but I am in the prototype phase so a little imprecision doesn't bother me), and volume is whatever postgres supplies for a bigint type. The example I gave might not include volume because on index data there's no volume since it's just a number.

  3. For clarity: My understanding is that make-dfing a data frame requires each column to be one single lisp vector, which is at odds with how most (all?) SQL servers return results.

  4. It is also my understanding that there is no way to get Postgres to return the data in that format, and/or there's no pre-existing function somewhere hiding in lisp-stat or postmodern that will allow a data frame to directly ingest the results returned from postgres.

Do I have that about right?

If my understanding is correct then I'm happy to play with the profiler and find a computationally cheap way to massage the data from one format to the other. No problem.

nathanvy commented 5 days ago

Curious whether this works for you.

Assume ohlcv-data is returned from a query that looks like (with relevant where clause):

(pomo:query "select dtg, open, high, low, close from ohlcv")

Is the following equivalent to your function above?

(defun query-to-df (ohlcv-data)
   (let ((keys #(:dtg :open :high :low :close)))
       (df:make-df keys (loop for x in (apply #'mapcar #'list ohlcv-data)
                                                   collect (coerce x 'vector)))))

This doesn't eval properly for me. Something about inappropriate usage of values-list. I'm using the :vectors results style so that might be the source of that. I haven't really investigated further.

sabracrolleton commented 5 days ago

@nathanvy your understanding is correct.

With respect to my alternative query-to-df function. It does NOT use the :vectors result style.

(let ((results (pomo:query "select dtg, open, high, low, close from ohlcv")))
     (query-to-df results))
nathanvy commented 5 days ago

Roger that. I suppose we can close this issue now. Thanks both of you.

snunez1 commented 4 days ago

One more way this could work using the vector-of-vectors returned by pomo:

  1. Combine the vectors into an array
  2. Transpose that array
  3. Make a data frame

To create a matrix from a vector of vectors:

(aops:combine #(#(0 1) #(2 3))) ; => #2A((0 1)
                                ;        (2 3))

transpose can be done with for-each:

(defparameter A #2A((1 2)
                    (3 4)))
;; Transpose
(aops:each-index (i j) (aref A j i)) ; => #2A((1 3)
                                     ;        (2 4))

and then make-df:

(matrix-df #(:a :b) #2A((1 2)
                    (3 4)))
;#<DATA-FRAME (2 observations of 2 variables)>

I don't know how efficient that will be, but it's convenient.

@nathanvy, I'd appreciate a report on how well it works when you decide on a method and know it's efficiency. I want to update the data-frame documentation with retrieving data from Postmodern. I suspect this is going to be generally useful.

nathanvy commented 4 days ago

Will that work with heterogenous data types? It doesn't seem to work for me.

Here's a csv of the type of data I'm working with. The columns are, in order:

  1. A local-time timestamp, which to handle properly requires (local-time:set-local-time-cl-postgres-readers)
  2. Open, a double-precision float
  3. Close, a double-precision float
  4. High, a double-precision float
  5. Low, a double-precision float
  6. Volume, a bigint (in this case null because I just happen to be using the Nasdaq index for prototyping)
snunez1 commented 4 days ago

It should work with heterogeneous data types. What error are you encountering?

nathanvy commented 4 days ago
(defun query-to-df-2 (ohlcv-data)
  (let ((arr (aops:combine ohlcv-data))
        (keys #(:dtg :open :high :low :close)))
    (df:matrix-df keys (aops:each-index (i j) (aref arr j i)))))

(let* ((the-date (local-time:encode-timestamp 0 0 0 0 1 6 2024))
                (data (pomo:with-connection (list (getf *config* :db-name)
                                                 (getf *config* :db-username)
                                                 (getf *config* :db-pw)
                                                 (getf *config* :db-hostname))
                       (pomo:query (format nil "select dtg, open, high, low, close from ndx where dtg > '~a' and dtg < '~a' order by dtg asc" the-date (local-time:now)) :vectors))))
           (query-to-df-2 data))
The value
  0
is not of type
  ARRAY
   [Condition of type TYPE-ERROR]

Restarts:
 0: [RETRY] Retry SLY mREPL evaluation request.
 1: [*ABORT] Return to SLY's top level.
 2: [ABORT] abort thread (#<THREAD "sly-channel-1-mrepl-remote-1" RUNNING {70066C0003}>)

Backtrace:
 0: (ARRAY-DIMENSIONS 0)
 1: (ARRAY-OPERATIONS/UTILITIES:SAME-DIMENSIONS-P #(0 0 0 0 0) 0)
 2: (ARRAY-OPERATIONS/DISPLACING:COPY-INTO #(0 0 0 0 0) 0)
 3: (ARRAY-OPERATIONS/DISPLACING:COMBINE #(#(3926395800 18666.72d0 18667.2d0 18652.6d0 18656.83d0) #(3926395860 18655.92d0 18661.23d0 18649.92d0 18655.89d0) #(3926395920 18655.39d0 18679.04d0 18655.0d0 18..
[...]
snunez1 commented 4 days ago

That basically looks correct. As a next step I'd try capturing the output of pomo:query in a variable and then seeing what happens when combine is called on it. Are all the vectors of the same length in whats returned from the query?

Edit: I've just tested a round trip from a data-frame to a vector-of-vectors and back:

LS-USER> mtcars
#<DATA-FRAME (32 observations of 12 variables)
Motor Trend Car Road Tests>

LS-USER> (as-array mtcars)
#2A(("Mazda RX4" 21 6 160 110 3.9D0 2.62D0 16.46D0 0 1 4 4) ("Mazda RX4 Wag" 21 6 160 110 3.9D0 2.875D0 17.02D0 0 1 4 4) ("Datsun 710" 22.8D0 4 108 93 3.85D0 2.32D0 18.61D0 1 1 4 1) ("Hornet 4 Drive" 21.4D0 6 258 110 3.08D0 3.215D0 19.44D0 1 0 3 1) ("Hornet Sportabout" 18.7D0 8 360 175 3.15D0 3.44D0 17.02D0 0 0 3 2) ("Valiant" 18.1D0 6 225 105 2.76D0 3.46D0 20.22D0 1 0 3 1) ("Duster 360" 14.3D0 8 360 245 3.21D0 3.57D0 15.84D0 0 0 3 4) ("Merc 240D" 24.4D0 4 146.7D0 62 3.69D0 3.19D0 20 1 0 4 2) ("Merc 230" 22.8D0 4 140.8D0 95 3.92D0 3.15D0 22.9D0 1 0 4 2) ("Merc 280" 19.2D0 6 167.6D0 123 3.92D0 3.44D0 18.3D0 1 0 4 4) ("Merc 280C" 17.8D0 6 167.6D0 123 3.92D0 3.44D0 18.9D0 1 0 4 4) ("Merc 450SE" 16.4D0 8 275.8D0 180 3.07D0 4.07D0 17.4D0 0 0 3 3) ("Merc 450SL" 17.3D0 8 275.8D0 180 3.07D0 3.73D0 17.6D0 0 0 3 3) ("Merc 450SLC" 15.2D0 8 275.8D0 180 3.07D0 3.78D0 18 0 0 3 3) ("Cadillac Fleetwood" 10.4D0 8 472 205 2.93D0 5.25D0 17.98D0 0 0 3 4) ("Lincoln Continental" 10.4D0 8 460 215 3 5.424D0 17.82D0 0 0 3 4) ("Chrysler Imperial" 14.7D0 8 440 230 3.23D0 5.345D0 17.42D0 0 0 3 4) ("Fiat 128" 32.4D0 4 78.7D0 66 4.08D0 2.2D0 19.47D0 1 1 4 1) ("Honda Civic" 30.4D0 4 75.7D0 52 4.93D0 1.615D0 18.52D0 1 1 4 2) ("Toyota Corolla" 33.9D0 4 71.1D0 65 4.22D0 1.835D0 19.9D0 1 1 4 1) ("Toyota Corona" 21.5D0 4 120.1D0 97 3.7D0 2.465D0 20.01D0 1 0 3 1) ("Dodge Challenger" 15.5D0 8 318 150 2.76D0 3.52D0 16.87D0 0 0 3 2) ("AMC Javelin" 15.2D0 8 304 150 3.15D0 3.435D0 17.3D0 0 0 3 2) ("Camaro Z28" 13.3D0 8 350 245 3.73D0 3.84D0 15.41D0 0 0 3 4) ("Pontiac Firebird" 19.2D0 8 400 175 3.08D0 3.845D0 17.05D0 0 0 3 2) ("Fiat X1-9" 27.3D0 4 79 66 4.08D0 1.935D0 18.9D0 1 1 4 1) ("Porsche 914-2" 26 4 120.3D0 91 4.43D0 2.14D0 16.7D0 0 1 5 2) ("Lotus Europa" 30.4D0 4 95.1D0 113 3.77D0 1.513D0 16.9D0 1 1 5 2) ("Ford Pantera L" 15.8D0 8 351 264 4.22D0 3.17D0 14.5D0 0 1 5 4) ("Ferrari Dino" 19.7D0 6 145 175 3.62D0 2.77D0 15.5D0 0 1 5 6) ("Maserati Bora" 15 8 301 335 3.54D0 3.57D0 14.6D0 0 1 5 8) ("Volvo 142E" 21.4D0 4 121 109 4.11D0 2.78D0 18.6D0 1 1 4 2))

LS-USER> (split * 1)
#(#("Mazda RX4" 21 6 160 110 3.9D0 2.62D0 16.46D0 0 1 4 4) #("Mazda RX4 Wag" 21 6 160 110 3.9D0 2.875D0 17.02D0 0 1 4 4) #("Datsun 710" 22.8D0 4 108 93 3.85D0 2.32D0 18.61D0 1 1 4 1) #("Hornet 4 Drive" 21.4D0 6 258 110 3.08D0 3.215D0 19.44D0 1 0 3 1) #("Hornet Sportabout" 18.7D0 8 360 175 3.15D0 3.44D0 17.02D0 0 0 3 2) #("Valiant" 18.1D0 6 225 105 2.76D0 3.46D0 20.22D0 1 0 3 1) #("Duster 360" 14.3D0 8 360 245 3.21D0 3.57D0 15.84D0 0 0 3 4) #("Merc 240D" 24.4D0 4 146.7D0 62 3.69D0 3.19D0 20 1 0 4 2) #("Merc 230" 22.8D0 4 140.8D0 95 3.92D0 3.15D0 22.9D0 1 0 4 2) #("Merc 280" 19.2D0 6 167.6D0 123 3.92D0 3.44D0 18.3D0 1 0 4 4) #("Merc 280C" 17.8D0 6 167.6D0 123 3.92D0 3.44D0 18.9D0 1 0 4 4) #("Merc 450SE" 16.4D0 8 275.8D0 180 3.07D0 4.07D0 17.4D0 0 0 3 3) #("Merc 450SL" 17.3D0 8 275.8D0 180 3.07D0 3.73D0 17.6D0 0 0 3 3) #("Merc 450SLC" 15.2D0 8 275.8D0 180 3.07D0 3.78D0 18 0 0 3 3) #("Cadillac Fleetwood" 10.4D0 8 472 205 2.93D0 5.25D0 17.98D0 0 0 3 4) #("Lincoln Continental" 10.4D0 8 460 215 3 5.424D0 17.82D0 0 0 3 4) #("Chrysler Imperial" 14.7D0 8 440 230 3.23D0 5.345D0 17.42D0 0 0 3 4) #("Fiat 128" 32.4D0 4 78.7D0 66 4.08D0 2.2D0 19.47D0 1 1 4 1) #("Honda Civic" 30.4D0 4 75.7D0 52 4.93D0 1.615D0 18.52D0 1 1 4 2) #("Toyota Corolla" 33.9D0 4 71.1D0 65 4.22D0 1.835D0 19.9D0 1 1 4 1) #("Toyota Corona" 21.5D0 4 120.1D0 97 3.7D0 2.465D0 20.01D0 1 0 3 1) #("Dodge Challenger" 15.5D0 8 318 150 2.76D0 3.52D0 16.87D0 0 0 3 2) #("AMC Javelin" 15.2D0 8 304 150 3.15D0 3.435D0 17.3D0 0 0 3 2) #("Camaro Z28" 13.3D0 8 350 245 3.73D0 3.84D0 15.41D0 0 0 3 4) #("Pontiac Firebird" 19.2D0 8 400 175 3.08D0 3.845D0 17.05D0 0 0 3 2) #("Fiat X1-9" 27.3D0 4 79 66 4.08D0 1.935D0 18.9D0 1 1 4 1) #("Porsche 914-2" 26 4 120.3D0 91 4.43D0 2.14D0 16.7D0 0 1 5 2) #("Lotus Europa" 30.4D0 4 95.1D0 113 3.77D0 1.513D0 16.9D0 1 1 5 2) #("Ford Pantera L" 15.8D0 8 351 264 4.22D0 3.17D0 14.5D0 0 1 5 4) #("Ferrari Dino" 19.7D0 6 145 175 3.62D0 2.77D0 15.5D0 0 1 5 6) #("Maserati Bora" 15 8 301 335 3.54D0 3.57D0 14.6D0 0 1 5 8) #("Volvo 142E" 21.4D0 4 121 109 4.11D0 2.78D0 18.6D0 1 1 4 2))

LS-USER> (combine *)
#2A(("Mazda RX4" 21 6 160 110 3.9D0 2.62D0 16.46D0 0 1 4 4) ("Mazda RX4 Wag" 21 6 160 110 3.9D0 2.875D0 17.02D0 0 1 4 4) ("Datsun 710" 22.8D0 4 108 93 3.85D0 2.32D0 18.61D0 1 1 4 1) ("Hornet 4 Drive" 21.4D0 6 258 110 3.08D0 3.215D0 19.44D0 1 0 3 1) ("Hornet Sportabout" 18.7D0 8 360 175 3.15D0 3.44D0 17.02D0 0 0 3 2) ("Valiant" 18.1D0 6 225 105 2.76D0 3.46D0 20.22D0 1 0 3 1) ("Duster 360" 14.3D0 8 360 245 3.21D0 3.57D0 15.84D0 0 0 3 4) ("Merc 240D" 24.4D0 4 146.7D0 62 3.69D0 3.19D0 20 1 0 4 2) ("Merc 230" 22.8D0 4 140.8D0 95 3.92D0 3.15D0 22.9D0 1 0 4 2) ("Merc 280" 19.2D0 6 167.6D0 123 3.92D0 3.44D0 18.3D0 1 0 4 4) ("Merc 280C" 17.8D0 6 167.6D0 123 3.92D0 3.44D0 18.9D0 1 0 4 4) ("Merc 450SE" 16.4D0 8 275.8D0 180 3.07D0 4.07D0 17.4D0 0 0 3 3) ("Merc 450SL" 17.3D0 8 275.8D0 180 3.07D0 3.73D0 17.6D0 0 0 3 3) ("Merc 450SLC" 15.2D0 8 275.8D0 180 3.07D0 3.78D0 18 0 0 3 3) ("Cadillac Fleetwood" 10.4D0 8 472 205 2.93D0 5.25D0 17.98D0 0 0 3 4) ("Lincoln Continental" 10.4D0 8 460 215 3 5.424D0 17.82D0 0 0 3 4) ("Chrysler Imperial" 14.7D0 8 440 230 3.23D0 5.345D0 17.42D0 0 0 3 4) ("Fiat 128" 32.4D0 4 78.7D0 66 4.08D0 2.2D0 19.47D0 1 1 4 1) ("Honda Civic" 30.4D0 4 75.7D0 52 4.93D0 1.615D0 18.52D0 1 1 4 2) ("Toyota Corolla" 33.9D0 4 71.1D0 65 4.22D0 1.835D0 19.9D0 1 1 4 1) ("Toyota Corona" 21.5D0 4 120.1D0 97 3.7D0 2.465D0 20.01D0 1 0 3 1) ("Dodge Challenger" 15.5D0 8 318 150 2.76D0 3.52D0 16.87D0 0 0 3 2) ("AMC Javelin" 15.2D0 8 304 150 3.15D0 3.435D0 17.3D0 0 0 3 2) ("Camaro Z28" 13.3D0 8 350 245 3.73D0 3.84D0 15.41D0 0 0 3 4) ("Pontiac Firebird" 19.2D0 8 400 175 3.08D0 3.845D0 17.05D0 0 0 3 2) ("Fiat X1-9" 27.3D0 4 79 66 4.08D0 1.935D0 18.9D0 1 1 4 1) ("Porsche 914-2" 26 4 120.3D0 91 4.43D0 2.14D0 16.7D0 0 1 5 2) ("Lotus Europa" 30.4D0 4 95.1D0 113 3.77D0 1.513D0 16.9D0 1 1 5 2) ("Ford Pantera L" 15.8D0 8 351 264 4.22D0 3.17D0 14.5D0 0 1 5 4) ("Ferrari Dino" 19.7D0 6 145 175 3.62D0 2.77D0 15.5D0 0 1 5 6) ("Maserati Bora" 15 8 301 335 3.54D0 3.57D0 14.6D0 0 1 5 8) ("Volvo 142E" 21.4D0 4 121 109 4.11D0 2.78D0 18.6D0 1 1 4 2))

LS-USER> (matrix-df (keys mtcars) *)
#<DATA-FRAME (32 observations of 12 variables)>
nathanvy commented 3 days ago

Are all the vectors of the same length in whats returned from the query?

They are, but it looks like there are dozens of zero-elements towards the end of the outer vector. With the :vectors results style, postmodern is probably calling vector-push-extend internally which tries to avoid mallocs under the hood by doing them in chunks and filling with zeroes. I bet that's what's tripping the combine operation up.