JuliaApproximation / DomainSets.jl

A Julia package for describing domains as continuous sets of elements
MIT License
72 stars 12 forks source link

Making VectorProductDomain more generic #66

Closed oschulz closed 3 years ago

oschulz commented 4 years ago

Would it be possible for VectorProductDomain to have a type parameter for the array type, to allow allow for using a CuArray, StructArray, etc.? If so, adding support for Adapt.jl might make sense in the same context.

daanhb commented 4 years ago

It should be relatively easy to add a new product domain for a specific type, since most of the logic is implemented for the abstract supertype ProductDomain.

In contrast, parameterizing the container type itself is not a simple thing to do, I'm afraid. What do you want to do exactly? Right now, it should be possible to use any AbstractVector to check for membership of a VectorProductDomain without conversion, unless the element types don't match. (Do let me know if this doesn't work, it should.) But this assumes you're happy with storing the domains themselves as a Vector. If that is not the case, something new would be needed.

oschulz commented 4 years ago

I meant using something like

struct VectorProductDomain{T,D,V<:AbstractVector{D}} <: ProductDomain{V}
    domains::D
end

Adapt.adapt_structure(to, d::VectorProductDomain) = VectorProductDomain(adapt(to, d.domain))

So that one can do things like adapt(CuArray, my_domain), as part of lifting a larger computation (with nested data structures, containing ProductDomains somewhere) onto a GPU.

daanhb commented 4 years ago

Note that there are four types involved here:

The connection is that eltype(DD) == D and eltype(D)==T. In turn, eltype(V)==T. In other words, V is some vector of elements of type T, and DD is some vectors of elements of type Domain{T}. The simplification in VectorProductDomain is that V==Vector{T} and DD==Vector{D}.

So, the most general vector product domain would be:

struct GeneralVectorProductDomain{DD,T,V<:AbstractVector{T}} <: ProductDomain{V}
    domains::DD
end

It would be assumed that domains is an indexable list of domains of type D. A constructor would have to verify or ensure that eltype(eltype(domains))==eltype(V)==T.

In your intended usage with adapt, which of the types would end up changing? This is not clear to me. Is it enough for V to change to a CuArray, or would DD have to be a CuArray of Domain{T}? If that is even possible, I don't know :-)

oschulz commented 4 years ago

Well, in many use cases, e.g. whendomains if a vector of a bitstype, e.g. intervals, representing domains as a CuArray should be no problem, right?

oschulz commented 4 years ago

Ah, sorry - I think I misunderstood something, I mixed up Vector{D} and Vector{T}.

oschulz commented 4 years ago

Now I see - the difficulty here stems from the fact that domains don't have to be a subtype of Domain.

Checking the element types in the ctor, like you proposed, should be fine though, right?

In your intended usage with adapt, which of the types would end up changing? This is not clear to me. Is it enough for V to change to a CuArray, or would DD have to be a CuArray of Domain{T}

Hm ... not sure, I think I haven't fully understood the role of V yet. But DD would definitely be a CuArray. Let's take a this use case - a very high dimensional space, but the domains are simple intervals. So I'd like to do something like this

d = ProductDomain([0..i for i in 1:10^5]
d_cu = adapt(CuArray, d)

with the result that d_cu.domains is a CuArray. So that something like

in.(lots_of_points_cu, Ref(d_cu))

could be made very fast.

daanhb commented 4 years ago

Okay, the example helps, but I think we're not in the same place yet :-)

If pd is a product domain, then in(x, pd) makes sense when x is a vector. Here, the output is a single boolean: true if all the elements of x belong to the corresponding domains in pd. The in function maps a vector to a single boolean.

The statement in.(x, Ref(pd)) makes sense when x is a vector of vectors. In that case, the output is a vector of booleans: the broadcast notation yields something like [in(el, pd) for el in x], so each el has to be a vector.

Somehow I don't think that is what you have in mind. Is it? If you want to check a very large vector for membership of a very high-dimensional product domain and expect a single boolean output, then we have to look into how to do the mapreduce operation efficiently on GPU's. If you want to obtain a large vector of booleans, with each element telling whether x[i] belongs to pd.domains[i], then you don't actually need a product domain. You could do in.(x_cu, domain_cu) where domain_cu is a CuArray of intervals (or other domains).

daanhb commented 4 years ago

For reference, here is the mapreduce operation of a product domain. I don't know whether this would be efficient when x and domains are CuArray's, or even if I could represent a vector of domains on a GPU - I'm not familiar enough with it I'm afraid.

oschulz commented 4 years ago

The statement in.(x, Ref(pd)) makes sense when x is a vector of vectors. In that case, the output is a vector of booleans

Yes, that's what I had in mind, as a use case example. For example assume you've generated a lot of samples via MCMC in a high-dimensional space, and want to know which are in a certain sub-volume. Or just which fraction of them, in which case you'd then take the mean of the resulting boolean vector.

oschulz commented 4 years ago

For reference, here is the mapreduce operation of a product domain

In my described use case, many mapreduce's would run in parallel - not sure how efficient it would be, but I think a fitting implementation could be found. It might not even be about the pure speed - you may have to do volume membership tests as part of a larger computation, and want to avoid moving data back to the CPU for that.

daanhb commented 4 years ago

Ok cool, sorry - it looked like a suspiciously advanced use case to me (I guess it is!) especially with the product domain itself already having dimension 10^5 in your example. It may take more work to be efficient, but I agree that the first goal should be to be able to express the problem. So I'll look into that general product domain first.

oschulz commented 4 years ago

Well, it wouldn't necessarily be 10^5 dimensions, but in Bayesian inference (e.g. via HMC), dimensionality can get high. And we often need a way to express parameter bounds.

It may take more work to be efficient, but I agree that the first goal should be to be able to express the problem. So I'll look into that general product domain first.

Thanks! I agree, making things really efficient may not be so easy, but being able to store bounds in a flexible would already be great. It would also make it possible to use things like StructArray, which would (e.g. for intervals) result in an column-oriented memory layout.

daanhb commented 4 years ago

Meanwhile I implemented a more general product domain on master. I'm hesitant to include packages that are not otherwise related to domains as dependencies, but it should now be easy to link to Adapt.jl elsewhere.

It is called GenericVectorProductDomain, but one can also use the existing ProductDomain function (with an AbstractVector of domains as argument) or the ProductDomain{V} constructor where V is an AbstractVector. The concrete subtype that is returned depends on the types of the arguments and on V.

julia> using DomainSets, OffsetArrays

julia> pd = GenericVectorProductDomain([0..i for i in 1:500])
0..1 x 0..2 x 0..3 x 0..4 x 0..5 x ... x 0..496 x 0..497 x 0..498 x 0..499 x 0..500

julia> pd_ofs = ProductDomain(OffsetArray([0..i for i in 1:10], 1:10))
0..1 x 0..2 x 0..3 x 0..4 x 0..5 x 0..6 x 0..7 x 0..8 x 0..9 x 0..10

julia> typeof(ans)
GenericVectorProductDomain{Array{Int64,1},OffsetArray{Interval{:closed,:closed,Int64},1,Array{Interval{:closed,:closed,Int64},1}}}

julia> pd = GenericVectorProductDomain{Vector{Float64}}([0..i for i in 1:500])
0.0..1.0 x 0.0..2.0 x 0.0..3.0 x 0.0..4.0 x 0.0..5.0 x ... x 0.0..496.0 x 0.0..497.0 x 0.0..498.0 x 0.0..499.0 x 0.0..500.0

In the line above, since V is a Vector{Float64}, the element type of the domains is converted to Float64. This is to ensure that, as mentioned above, eltype(eltype(domains))==eltype(V).

Would this work?

oschulz commented 4 years ago

Oh, yes, very nice:

using DomainSets, IntervalSets, StructArrays
d = ProductDomain(StructArray([0..i for i in 1:10]))
d.domains isa StructArray
oschulz commented 4 years ago

Thank you!

daanhb commented 4 years ago

I'm happy to hear that it works! I forgot to test the in.(x, Ref(pd)) example, but I just did and it seems to work too. Once the basic functionality seems right, I'll tag the minor release update.

For reference, as I mentioned the membership of product domains revolves around a mapreduce function call, though there are a few wrappers around it. All memory allocations do occur in that function call, I checked. In case you want to change this for a combination of types (for particular DD and V say), it is best to specialize indomain.

The GenericVectorProductDomain is still a vector domain (there could be an even more general product domain I suppose), because vector domains are a special case: the in method accepts any AbstractVector and passes it on to indomain function without conversion. That is because conversions of arrays are typically not very efficient.

oschulz commented 4 years ago

What's the difference between ProductDomain and GenericVectorProductDomain now`?

Btw - seems to run on a GPU, too, now (just not super fast, but that was to be expected). Will look more deeply into this at some point.

oschulz commented 4 years ago

Regarding Ref(pd) in bcasts - how about defining

@inline Base.Broadcast.broadcastable(d::Domain) = Ref(d)

I guess domains should in general behave like scalars under broadcasting, right?

daanhb commented 4 years ago

ProductDomain is the abstract supertype of product domains. I've made different concrete subtypes, because they require different type parameters and I did not want to make a single super-parameteric type.

One advantage of the abstract supertype is that it enables the ProductDomain{T}(...) syntax, i.e., you can specify what the element type of the product domain should be. You can't usually append a type parameter to a function, only to types. Also, the constructor returns a concrete type that matches the product domain the user has in mind (based on the arguments), without the user having to worry about what the concrete type is called. This means the implementation of the concrete type could also change in between releases, as long as their behaviour remains the same. It also means I can make special cases that are more efficient (for example require fewer type parameters) for the compiler.

So, you'll always have ProductDomain(whatever) isa ProductDomain, but in reality it will be some concrete subtype.

daanhb commented 4 years ago

Regarding Ref(pd) in bcasts - how about defining

@inline Base.Broadcast.broadcastable(d::Domain) = Ref(d)

I guess domains should in general behave like scalars under broadcasting, right?

I'm afraid we already used broadcast to make maps, e.g.: 2 * d .+ 4 maps the domain by 2x+4. It is based on seeing the domain as a container of points. It is just a continuous container, rather than a discrete one.

But perhaps something can be done when broadcast styles are mixed, reverting to Ref(d) in that case.

oschulz commented 4 years ago

Ah, silly me, sorry. I see, ProductDomain returns either a VectorProductDomain or a GenericVectorProductDomain, depending on the array type. Just curious, is there a reason to have both?

oschulz commented 4 years ago

It is based on seeing the domain as a container of points.

Oh, sure - Those semantics do make sense of course!

daanhb commented 4 years ago

Ah, silly me, sorry. I see, ProductDomain returns either a VectorProductDomain or a GenericVectorProductDomain, depending on the array type. Just curious, is there a reason to have both?

Of course any VectorProductDomain can be represented as a GenericVectorProductDomain, but that doesn't mean that it has to. The implementation is shared because it is implemented at a higher level. In some other case where I've used this pattern, it tremendously simplifies the output of statements, in particular when they contain the type... That is actually my main motivation :-)

A hidden cost of generalizing types is also in compiler time, though I've never actually tried to measure whether having fewer type parameters helps.

oschulz commented 4 years ago

You can't usually append a type parameter to a function, only to types

Well, it's possible as long as the functions are constructors, of course. But yes, one has to be careful with the order of type parameters, it can't be changed, though types at the end can be left out. But for a type Foo{A, B, C} one can define a custom ctor Foo{A}(...). I agree, can be fiddly sometimes.

oschulz commented 4 years ago

But I'm all in favour of having abstract types, keeps things extendable. I was just wondering if VectorProductDomain and GenericVectorProductDomain should be separate types, but I won't try to meddle with the design of your package (which I like a lot). :-)

daanhb commented 4 years ago

But I'm all in favour of having abstract types, keeps things extendable. I was just wondering if VectorProductDomain and GenericVectorProductDomain should be separate types, but I won't try to meddle with the design of your package (which I like a lot). :-)

So VectorProductDomain{T} has a single type parameter, that's all. This pattern is fairly recent, it wasn't like that before. But for general types with many parameters the constructors for all the different use cases quickly become difficult to manage. In this case, the final straw came when I realized that certain use cases needed an extra parameter, but the others did not.

For completeness, and because there is no documentation, it is about the product of a 1D with a 2D domain. You typically want this to behave as a 3D domain. But then in may be called with an SVector{3}, and the job of the product domain is to call its elements with a 1d and 2d vector respectively. This information is kept in the DIM parameter in the VcatDomain. This gave me some headaches for a while. In contrast, a VectorDomain assumes that all dimensions are the same, because storing the information necessary for the inverse of vcat would require storing another vector. Worse is that all types of the domains have to be the same in all dimensions too, otherwise it becomes less efficient.

daanhb commented 4 years ago

Hmm, it is VectorProductDomain{T,D} with two type parameters.

oschulz commented 4 years ago

But for general types with many parameters the constructors for all the different use cases quickly become difficult to manage.

I know what you mean ... had merry fun with that in ArraysOfArrays. :-)

oschulz commented 4 years ago

it is about the product of a 1D with a 2D domain. You typically want this to behave as a 3D domain

Oh, right ... you did some heavy lifting, here!

daanhb commented 4 years ago

I'm reconsidering and I currently made the VectorProductDomain more generic after all (not committed yet). It makes the code shorter. So, GenericVectorProductDomain will disappear again soon.

oschulz commented 4 years ago

Nice! I have a very clunky domain/bounds system in the internals of BAT.jl, and I've love to replace it with something clean like DomainSets ...

daanhb commented 3 years ago

I think the code we were talking about has been merged, so I'm closing this issue - feel free to reopen or open a new one if things can be improved.

oschulz commented 3 years ago

Sure, thanks again!