JordanMarr / SqlHydra

SqlHydra is a suite of NuGet packages for working with databases in F# including code generation tools and query expressions.
MIT License
212 stars 20 forks source link

How to use a correlated subquery? #48

Closed ntwilson closed 1 year ago

ntwilson commented 1 year ago

I'm trying to write a query with a subquery that references rows from the parent query. For example, say I want SQL for getting the latest weather value from a table of weather data that looks roughly like

SELECT * FROM weather w
WHERE 
  w.time >= '2023-02-01' AND w.time <= '2023-02-10' 
  AND w.timestamp = (
    SELECT MAX(timestamp) FROM weather inner 
    WHERE inner.location = w.location AND inner.instrument = w.instrument AND inner.time = 
w.time
  )

First I tried

let lBound,uBound = (DateTime(2023,02,01), DateTime(2023,02,10))
select { 
  for w in table<weather> do
  where (
    w.time >= lBound && w.time <= uBound && w.timestamp = subqueryOne (
      select { 
        for inner in table<weather> do
        where (inner.location = w.location && inner.instrument = w.instrument && inner.time = w.time)
        select (maxBy inner.timestamp)
      }
    )
  )
}

but that doesn't let me use a select to start a subquery inside another one. So then I tried

let lBound,uBound = (DateTime(2023,02,01), DateTime(2023,02,10))
let selectInner = select
select { 
  for w in table<weather> do
  where (
    w.time >= lBound && w.time <= uBound && w.timestamp = subqueryOne (
      selectInner { 
        for inner in table<weather> do
        where (inner.location = w.location && inner.instrument = w.instrument && inner.time = w.time)
        select (maxBy inner.timestamp)
      }
    )
  )
}

which compiles, but gave me

System.NotImplementedException : The method or operation is not implemented. Stack Trace: at SqlHydra.Query.LinqExpressionVisitors.visit@212(FSharpFunc2 qualifyColumn, Expression exp, Query query) at SqlHydra.Query.LinqExpressionVisitors.visit@212(FSharpFunc2 qualifyColumn, Expression exp, Query query) at SqlHydra.Query.SelectBuilders.SelectBuilder2.Where[T](QuerySource2 state, Expression`1 whereExpression)

I thought I'd try

let lBound,uBound = (DateTime(2023,02,01), DateTime(2023,02,10))
let sub = select {
  for inner in table<weather> do
  groupBy (inner.location, inner.instrument, inner.time)
  select (inner.location, inner.instrument, inner.time, maxBy inner.timestamp)
}
select { 
  for w in table<weather> do
  where (
    w.time >= lBound && w.time <= uBound && 
    (w.location, w.instrument, w.time, w.timestamp) = subqueryOne sub
  )
}

which also gives me a NotImplementedException (not surprising, and I think there's another open issue about using tuples in a where clause) and

let lBound,uBound = (DateTime(2023,02,01), DateTime(2023,02,10))
let sub = select {
  for inner in table<weather> do
  groupBy (inner.location, inner.instrument, inner.time)
  select (inner.location, inner.instrument, inner.time, maxBy inner.timestamp)
}
select { 
  for w in table<weather> do
  for (location, instrument, time, timestamp) in subqueryMany sub do
  where (
    w.time >= lBound && w.time <= uBound && 
    w.location = location && w.instrument = instrument && w.time = time && w.timestamp = timestamp
  )
}

but that fails to compile.

Any suggestions how to get this sort of query to work?

JordanMarr commented 1 year ago

Interesting... So you are trying to do a correlated subquery, but the compiler doesn't allow embedding the sub query. The last option seems like a possible workaround (duplicating the filtering to avoid the requirement of passing the correlated parent field).

I'll take a look at it.

JordanMarr commented 1 year ago

In the meantime, since you already know the query you want, you can do this:

let getForecast (lower: DateTime, upper: DateTime) = task {
  let sql = """
    SELECT * FROM myschema.weather w
    WHERE     w.time >= @lower AND w.time <= @upper
      AND w.timestamp = (
        SELECT MAX(timestamp) FROM weather inner 
        WHERE inner.location = w.location AND inner.instrument = w.instrument AND inner.time =  w.time
    )
  """

  use conn = new NpgsqlConnection(DB.connectionString)
  do! conn.OpenAsync()
  use cmd = new NpgsqlCommand(sql, conn)
  cmd.Parameters.Add(NpgsqlParameter("lower", lower)) |> ignore
  cmd.Parameters.Add(NpgsqlParameter("upper", upper)) |> ignore
  let reader = cmd.ExecuteReader()
  let hydra = HydraReader(reader :?> NpgsqlDataReader)

  return [ while reader.Read() do hydra.``myschema.weather``.Read() ]
}
ntwilson commented 1 year ago

Thanks for the help on this! Having the workaround for a hand-written query is definitely helpful.

JordanMarr commented 1 year ago

At least you can still benefit from the strong typing of the HydraReader, even if the query is manually typed. Sometimes it's just easier that way for the more complex queries.

ntwilson commented 1 year ago

Do you have any thoughts on how you would like to handle this situation? I might be able to spend some time looking into a solution for this if you have a plan for what you want the solution to look like.

JordanMarr commented 1 year ago

Here are a few options for getting correlated subqueries working:

Option 1

The subquery is still in its own function, and the parent table is passed into the subquery. (The naming of the passed in table, od, would matter since that is used to define the table alias, so it would need to match what is in the parent query.)

let maxOrderQty (od: Sales.SalesOrderDetail) = 
    select {
        for d in orderDetailTable do
        where (d.ProductID = od.ProductID)
        select (maxBy d.OrderQty)
    }

let! results = 
    select {
        for od in orderDetailTable do
        where (od.OrderQty = subqueryOne (maxOrderQty od))
        orderBy od.ProductID
        select (od.SalesOrderID, od.ProductID, od.OrderQty)
    }
    |> ctx.ReadAsync HydraReader.Read

The problem I hit with I ran into in my experiment branch is that the LinqExpressionVisitor would need to be able to actually evaluate that function to get the resulting SqlKata.Query, and I'm not sure that is possible. So that might be a dead-end... I'm not sure.

Option 2

One way to bypass this could be to create a new function, similar to table<>, that could be used in a separate subquery function to declaratively designate a parent table source without actually passing one in. Something like this:

let maxOrderQty = 
    let od = correlatedTable<Sales.SalesOrderDetail> // or maybe `parentTable`
    select {
        for d in orderDetailTable do
        where (d.ProductID = od.ProductID)
        select (maxBy d.OrderQty)
    }

let! results = 
    select {
        for od in orderDetailTable do
        where (od.OrderQty = subqueryOne maxOrderQty)
        orderBy od.ProductID
        select (od.SalesOrderID, od.ProductID, od.OrderQty)
    }
    |> ctx.ReadAsync HydraReader.Read

Note that the correlatedTable function would be similar to the table function, but its definition would need to return an instance of the table 'T itself instead of a QuerySource<'T>.

Option 3

The third option would be to allow nesting the query within the parent query, at which point, it should be able to access the parent od table. I have seen this done before on the Pulumi.FSharp.Extensions project, and I think I asked him how he did it in the issues forum, but I don't remember:

bucket {
    name "bucket-example"
    acl  "private"

    bucketWebsite { 
        indexDocument "index.html"
    }
}

In this case, the subQueryOne and subQueryMany could be turned into nested CE builders of their own:

let! results = 
    select {
        for od in orderDetailTable do
        where (od.OrderQty = 
            subQueryOne {
                for d in orderDetailTable do
                where (d.ProductID = od.ProductID)
                select (maxBy d.OrderQty)
            }
        )
        orderBy od.ProductID
        select (od.SalesOrderID, od.ProductID, od.OrderQty)
    }
    |> ctx.ReadAsync HydraReader.Read

However, as I look at the buckets example from the Pulumi library, I don't think it would work for us since we need the subquery to be within the where clause.

TBH, it seems to me that Option 2 is our best bet. It's reasonably easy to understand, and should be easy to implement. What do you think?

ntwilson commented 1 year ago

Well, speaking as someone who is much less familiar with that code than you are 😉:

I'd be happy to look into option 2! I'm not sure exactly how to make the compiler reject a subquery that uses a parentTable or correlatedTable that isn't actually a part of the parent query, but I'd be happy to do some digging.

I think option 3 would be really cool from a usability standpoint (it really closely matches the SQL that someone would write), but I'm worried that might be really tough (at least for me) to figure out. Also, I think that's potentially introducing a breaking change unless we made different versions of subqueryOne and subqueryMany that have different names, and 4 different ways to do subqueries might be overkill.

Option 1 seems like it would similarly be tough to implement, and kinda seems like it doesn't add a lot of benefit over option 2.

JordanMarr commented 1 year ago

Tried a quick experiment with correlatedTable in this branch, but realizing that it is likely more complicated than my first thought.

I agree that option 3 would be really cool. Maybe it is possible to create an overload of the where custom operation on the select builder to allow embedding another select.