davidagold / StructuredQueries.jl

Query representations for Julia
Other
53 stars 5 forks source link

WIP: Include higher-order lifting #25

Closed davidagold closed 7 years ago

davidagold commented 7 years ago

I'm leaving this as WIP until I coordinate it across AbstractTables as well. This implements the higher-order lifting strategy described in https://github.com/davidagold/AbstractTables.jl/issues/2. The only noticeable user-facing change is that | and & now exhibit three-valued logic semantics, so long as they are visible to the querying macro:

julia> using TablesDemo

julia> tbl = TablesDemo.Table(
           A = NullableArray{Bool}(3),
           B = NullableArray([true, true, true])
       )

TablesDemo.Table
│ Row │ A     │ B    │
├─────┼───────┼──────┤
│ 1   │ #NULL │ true │
│ 2   │ #NULL │ true │
│ 3   │ #NULL │ true │

julia> @collect filter(tbl, A | B)
TablesDemo.Table
│ Row │ A     │ B    │
├─────┼───────┼──────┤
│ 1   │ #NULL │ true │
│ 2   │ #NULL │ true │
│ 3   │ #NULL │ true │

cc @johnmyleswhite

TODO:

codecov-io commented 7 years ago

Current coverage is 62.08% (diff: 86.66%)

Merging #25 into master will increase coverage by 5.04%

@@             master        #25   diff @@
==========================================
  Files            27         28     +1   
  Lines           305        335    +30   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits            174        208    +34   
+ Misses          131        127     -4   
  Partials          0          0          

Powered by Codecov. Last update 8d16960...17bd207

davidagold commented 7 years ago

@nalimilan That's a good idea re: null_safe_op. So, we'd have something like the following? It looks like the value of null_safe_op for a particular combination of f and x gets inlined during lowering and then the branch gets taken care of by llvm (maybe?)...

julia> function lift(f, x)
           if Base.null_safe_op(f, eltype(x))
               return Nullable(f(unsafe_get(x)), x.hasvalue)
           else
               U = Core.Inference.return_type(f, Tuple{eltype(typeof(x))})
               if isnull(x)
                   return Nullable{U}()
               else
                   return Nullable(f(unsafe_get(x)))
               end
           end
       end
lift (generic function with 1 method)

julia> f(x::Int) = x
f (generic function with 1 method)

julia> Base.null_safe_op(::typeof(f), ::Type{Int}) = true

julia> x = Nullable(1)
Nullable{Int64}(1)

julia> @code_warntype lift(f, x)
Variables:
  #self#::#lift
  f::#f
  x::Nullable{Int64}
  U::Type{Int64}

Body:
  begin 
      NewvarNode(:(U::Type{Int64}))
      unless $(QuoteNode(true)) goto 5 # line 3:
      return $(Expr(:new, Nullable{Int64}, :((Core.getfield)(x,:hasvalue)::Bool), :((Core.getfield)(x,:value)::Int64)))
      5:  # line 5:
      U::Type{Int64} = $(QuoteNode(Int64)) # line 6:
      unless (Base.box)(Base.Bool,(Base.not_int)((Core.getfield)(x::Nullable{Int64},:hasvalue)::Bool)) goto 12 # line 7:
      return $(Expr(:new, Nullable{Int64}, false))
      12:  # line 9:
      return $(Expr(:new, Nullable{Int64}, true, :((Core.getfield)(x,:value)::Int64)))
  end::Nullable{Int64}

julia> @code_llvm lift(f, x)

define void @julia_lift_63473(%Nullable.4* noalias sret, %Nullable.4*) #0 {
top:
  %.sroa.2 = alloca [7 x i8], align 1
  %2 = getelementptr inbounds %Nullable.4, %Nullable.4* %1, i64 0, i32 0
  %3 = load i8, i8* %2, align 1
  %4 = and i8 %3, 1
  %5 = getelementptr inbounds %Nullable.4, %Nullable.4* %1, i64 0, i32 1
  %6 = load i64, i64* %5, align 1
  %7 = getelementptr inbounds %Nullable.4, %Nullable.4* %0, i64 0, i32 0
  store i8 %4, i8* %7, align 8
  %8 = getelementptr inbounds i8, i8* %7, i64 1
  %9 = getelementptr inbounds [7 x i8], [7 x i8]* %.sroa.2, i64 0, i64 0
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %8, i8* %9, i64 7, i32 1, i1 false)
  %10 = getelementptr inbounds %Nullable.4, %Nullable.4* %0, i64 0, i32 1
  store i64 %6, i64* %10, align 8
  ret void
}

You probably already knew that, but I'd never looked at how the null_safe_op business lowers. That's pretty cool.

nalimilan commented 7 years ago

Yes, I'm not sure where exactly the branch elimination is done, but what's certain is that it enables SIMD instructions in simple cases (cf. the recent BaseBenchmarks addition).

Though you won't be able to use the two-argument form of the Nullable() constructor due to the reversal of isnull in 0.6: you'll have to write the ifelse manually like I did in my recent PR against NullableArrays.

davidagold commented 7 years ago

I think we need some design decisions re: semantics here as far as isequal and isless goes -- mostly to cover how to treat mixed signatures. For instance, if we decide that, as far as sorting goes, we always want null values to be sorted last, then lift(isless, x::T, y::Nullable{S}), where T and S are non-Nullable types, shouldn't default to isless(x, y) since that will MethodError.

edit: Similarly, for isequal, I think it's reasonable to expect that lift(isequal, 1, Nullable(1)) should return true, so that will need its own implementation. We should probably include == in the blacklist, then, and it should have similar behavior to isequal except that, since the former is a "stricter" and the latter is a "looser" sense of equality, we should have ==(Nullable{Int}(), Nullable{Int}()) == false and isequal(Nullable{Int}(), Nullable{Int}()) == true.

nalimilan commented 7 years ago

I'd say there are two different questions:

davidanthoff commented 7 years ago

We may want to support mixing scalars and nullables. I support this option, but I think that if we do it we should apply it for all functions, not just isequal and isless.

Same here, very much in favor of having support for mixing scalars and nullables. What I don't understand is whether the promotion system could be used for that or not.

johnmyleswhite commented 7 years ago

I won't argue it more than this one post, but I'm not in favor of mixing scalars and nullables. It seems to defeat all of the safety arguments for having nullables. Indeed, I'd argue that 1 + Nullable(1) is essentially an anti-promotion: you've gone from a safe scalar type to an unsafe scalar type that doesn't have any more precision, it just admits the dangerous NULL value.

nalimilan commented 7 years ago

What I don't understand is whether the promotion system could be used for that or not.

I don't think we actually need it, since the lifting semantics imply calling get on nullables when they are not null: lifting functions/macros would just pass scalars as-is (which unsafe_get does now BTW). Anyway, the promotion rule for scalar and nullables is simple: it wraps the scalar inside a nullable. Promotion is mainly useful to choose the element type.

nalimilan commented 7 years ago

@johnmyleswhite The problem is to support things like select(df, y = 2x) inside macros. Writing y = Nullable(2) * x would be really weird, since 2 is actually the only value in the expression which is not originally a nullable.

johnmyleswhite commented 7 years ago

Ok, but we already resolved the base case of select(df, y = 2x) with automatic lifting. If you're opposed to automatic lifting, I can see why you want to be able to deal with that, but I think that just begs the question why you need to keep adding more custom functionality to work around non-automatic lifting.

nalimilan commented 7 years ago

To be clear, I was speaking about what to do in automatic lifting contexts. What to do outside of them is another subject for endless discussions. :-)

johnmyleswhite commented 7 years ago

To be clear, I was speaking about what to do in automatic lifting contexts.

So I'm guessing things have changed a lot since I wrote my notes about this in June/July? In the draft I wrote, you don't ever need mixed operations -- everything computation happens in the non-nullable space except for null-ness checks.

davidagold commented 7 years ago

I see two uses for mixed-signature lifting:

  1. operating on tabular data types that store both Nullable and non-Nullable eltype columns. I know you're against this, John, but they may happen.
  2. interpolation of non-Nullable values into query argument expressions.
davidagold commented 7 years ago

So I'm guessing things have changed a lot since I wrote my notes about this in June/July? In the draft I wrote, you don't ever need mixed operations -- everything computation happens in the non-nullable space except for null-ness checks.

That's basically what this does?

davidagold commented 7 years ago

Okay, let me clarify some things for myself:

  1. In general, higher-order lift + generic isnull/unsafe_get supports mixed-signature lifting.
  2. The default for all mixed-signature lifting is still standard lifting semantics
  3. As Milan notes, the "semantics decisions" I mentioned above are not about whether or not to support mixed argument signatures for isless and isequal -- they are rather to ensure that certain invariants propagate, e.g. null is always greater than a non-null, and isequal(null, null) == true (or false, I don't really care so long as it's clearly documented). Put another way, isless and isequal receive non-standard lifting semantics, and so require special lift methods, similar to how | and & require special methods to implement 3VL semantics.
  4. My intuition is that this similar to the machinery implemented in John's notes, but moved to the lift function. John, if I'm mistaken, would you please clarify?
davidagold commented 7 years ago

Here's my interpretation of how this differs from John's notes. Suppose one does select(src, B = 2A). In John's notes, we generate a kernel row -> 2 * row[1]. Then we iterate over zip(src[:A]), checking at each iteration if the current row (of zip(src[:A])) has nulls. If it does, we insert a null into the result column B. If not, we map unwrap (equivalent to Base.unsafe_get) over the row, apply the kernel, and insert the result into B. Thus, the null checking happens in the iteration

In the paradigm implemented in this PR (and adapted to by https://github.com/davidagold/AbstractTables.jl/pull/5) we instead generate the kernel row -> lift(*, 2, row[1]) and essentially just map this over zip(src[:A]). That is, we iterate over zip(src[:A]), but rather than checking nulls before applying the kernel, we just apply the kernel and let the lift method handle all null checks.

It seems that the biggest difference entailed by the design implemented in this PR is that lift doesn't distinguish between 2 and row[1] as arguments -- that is, it doesn't see row[1] as an element of some column and 2 as a constant scalar. They're both just arguments and receive equal treatment -- i.e., null checks and unsafe gets. On the other hand, doing null checks during iteration as in John's notes singles out row[1] as an element of a Nullable eltype column and restricts the null checks to it.

However, since isnull is invariant over values of non-Nullable types, it should not contribute any unnecessary code load (ideally). unsafe_get should be similarly optimized away at compile time. So in practice --both in terms of semantics and in terms of code generation -- I don't see a huge difference between John's notes and the present implementation. Insofar as both strategies support select(src, B = 2*A), where the eltype of B is Nullable, they both support "mixed-signature" lifting. And insofar as the user doesn't need to do anything other than wrap select(src, B = 2*A) in a macro in order for lifting to work, such lifting is "automatic". So both approaches implement "automatic, mixed signature lifting". Neither approach ever defines a method *(::Int, Nullable{T}); mixed signature lifting is NOT achieved by method extension. It is purely a consequence of the present implementation of lift combined with generic isnull and unsafe_get functions.

Please, somebody tell me if I'm just crazy and/or missing something.

davidagold commented 7 years ago

I can also see how my comments regarding isless and my use of the term "mixed-signature lifting" may be confusing. My original point was that, in Base, isless(x::Nullable, y::Nullable) behaves in such a way so that null values are never isless than non-null values. Thus, isless does not follow standard lifting semantics; it propagates an invariant in a way similar to | and & in three-valued logic. So, what should lift(isless, x, y) do? We can't just have lift(isless, x, y) fall back on isless(x, y) because this will MethodError if one attempts to do, say filter(src, 2 < A) when the eltype of A is Nullable. My solution is not to define isless{T, S}(x::T, y::Nullable{S}), i.e. lift by method extension. Rather, my solution, which I will push shortly, is to define

@inline function lift(::typeof(isless), x, y)::Bool
    return ifelse( isnull(x),
        ifelse( isnull(y),
            false,                                  # x, y null
            false                                   # x null, y not null
        ),
        ifelse( isnull(y),
            true,                                   # x not null, y null
            isless(unsafe_get(x), unsafe_get(y))    # x, y not null
        )
    )
end

So this is "mixed-signature" lifting in the sense that it gives proper lifted semantics to filter(src, 2 < A) but it is not "mixed-signature" in the sense that it relies on any method defined on a mixed signature. This is pretty much the same as the above example of how filter(src, A | true) would work using the 3VL truth-tables implemented in this PR. One might say that the mixed-signature part of such lifting is handled by the "automatic" facility, which is just the syntactic transformation of f(xs...) to lift(f, xs...) within the call graph of a UDF expression.

Sorry to spam this thread with my yabbering. Does this help to clarify?

EDIT: Some change will have to be made to the above implementation of lift(::typeof(isless), x, y) to handle cases when the argument types are not "null safe", but the spirit of the solution remains the same.

nalimilan commented 7 years ago

Sounds like a good approach to me. Implementing the logic in a lift function allows separating it from the query macros, which sounds cleaner and makes it reusable.

davidagold commented 7 years ago

Sounds like a good approach to me. Implementing the logic in a lift function allows separating it from the query macros, which sounds cleaner and makes it reusable.

And this strategy allows for three-valued logic semantics.

davidagold commented 7 years ago

Hmm, I did a bunch of rebasing, so the diff has come out rather strange. I think the lesson there is, rename/reorganize files in a different branch, not while you're doing work on them. =/

EDIT: Also, TIL null_safe_op is not defined for 0.5.

davidagold commented 7 years ago

https://github.com/davidagold/AbstractTables.jl/pull/5 Looks good locally, and now includes a handful of tests of non-standard lifting semantics. (It also uncovered what I think is a substantial oversight concerning our syntax decisions. TODO: document this elsewhere and x-ref.) I'm going to look this PR over in the morning and then likely merge so I can make the other core changes to the internals and then lay out some semi-stable documentation.