Open ParadaCarleton opened 3 years ago
I'm unsure if this is the most intuitive behavior. The problem here is that, in general, the rows of a table represent independent observations, while each column is a different attribute.
Not in Julia though :confused: Since it's column-major and thus extracting each observation using an access pattern of x[:, i]
is faster than x[i, :]
. Therefore one should always use the representation of each column representing an observation.
My suggestion would be to use the following set of rules for
.~
broadcasting:
If the object being broadcasted over is a table from Tables.jl and is not an Array, Turing should recognize that each row is an independent observation for the purposes of
pointwise_loglikelihood
and any similar functions. (Although internally, it should be treated as a broadcast over columns -- this should improve performance, since columns usually have only one type.)If the object is an array, not from Tables.jl, and not square, broadcasting over columns is safe --
.~
will throw a dimension mismatch if the user meant to broadcast over rows. We can assume the user meant to broadcast over columns because Julia is column major without risking too much.If the object being broadcasted over is square and not a Table, or if it is both a Table and an Array, an error should be thrown.
I do agree that things are a bit confusing and ambigious, and appreciate your input on this:) But IMO this is a broader issue and not something that we should begin addressing in .~
. Eventually the addressing of the issue should reach .~
though.
It essentially is just a question of how to represent "batches" of inputs, i.e. a collection of "independent" inputs in the sense that we want the computation to be performed on each of the elements in the collection independently but we provide it as a Batch
to allow performance improvements.
This is something we've given a lot of thought already (https://github.com/TuringLang/Bijectors.jl/discussions/178) and something I've started playing around with in a Draft PR over in Bijectors.jl. We essentially want to extend the ColVecs
and RowVecs
from KernelFunctions.jl and provide it as a separate lightweight package. Once that's done we could start being way more explicit about all these things, e.g. we could easily add an iid
method that you suggested which really just wraps the Distribution
in way to tell it what sort of Batch
to expect.
And I don't think we want to in any specialize on Tables.jl in Turing.jl (this is just my personal opinion though!). This too should just be wrapped in something similar to ColVecs
or RowVecs
to specify what is what.
And I don't think we want to in any specialize on Tables.jl in Turing.jl (this is just my personal opinion though!). This too should just be wrapped in something similar to ColVecs or RowVecs to specify what is what.
Is there a reason why? I think this ability to specialize on a standardized API and have all my packages "just work" together the way I expect them to is probably my favorite thing about Julia.
Not in Julia though :confused: Since it's column-major and thus extracting each observation using an access pattern of x[:, i] is faster than x[i, :]. Therefore one should always use the representation of each column representing an observation.
I'm not sure about this, either from the perspective of intuitiveness/compatibility with Julia Base or speed:
julia> sort!.(x)
ERROR: MethodError: no method matching sort!(::Float64)
Is there a reason why? I think this ability to specialize on a standardized API and have all my packages "just work" together the way I expect them to is probably my favorite thing about Julia.
Yeah, I'm 100% with this! What I'm trying to say is that we should strive to make things work with more general inputs, e.g. iterators rather than demanding it to be an array, etc. rather than explicitly implement support for Tables.jl:) But if a special impl would still be required, we could add some glue code in Turing.jl (the rest of the changes needs to happen in DynamicPPL.jl) :+1:
In Base, broadcasts over arrays broadcast over each element, rather than slices of that array
I agree this particular aspect is counter-intuitive (I was suprised to find that we handled this case tbh when I first saw it), but it's the sort of feature that no one will accidentally use since it's only applicable if you do x .~ ::MultivariateDistribution
which doesn't make sense otherwise. Either you're aware of it and then use it, or you're not aware of it and you will never know of it's existence. Though I agree, there should be a better approach, hence the Batch
idea I was talking about above. This would make the handling completely unambiguouos and so no such suprises would occur :+1:
Btw, the reason why we've done the above is to replicate the way Distributions.jl handles collections of samples from MultivariateDistribution
.
With regards to speed, the problem is that if users have to transpose their dataframe to properly treat each column as a single IID observation, a lot of them will do so, which will often create columns of mixed type that can't be specialized on. (Each dimension should have one type, but observations are often of mixed type). I'd expect this to matter more than locality of reference, no?
The things is though, we don't support observations of mixed types since Distributions.jl doesn't :confused: And so such an example would have to be handled properly anyways.
The fact that we don't support Tables.jl for example, is not an issue with .~
but it's a more general and separate issue that I 100% agree needs to be addressed:
~
, e.g. NamedTuple
.Distribution
or Array{<:Distribution}
on the RHS of ~
, e.g. an iterator of Distribution
or something from MeasureTheory.jl, or just any struct
that implements a rand
and logpdf
.Once we've allow that, extensions to Tables.jl and others will be muuuuch easier:) You could even just use Tables.namedtupleiterator
or whatever it's called to do this once we had support for (1).
I started working on these two points a bit yesterday btw (https://github.com/TuringLang/DynamicPPL.jl/pull/292).
@ParadaCarleton Something like this: https://github.com/TuringLang/AbstractPPL.jl/pull/26?
@ParadaCarleton Something like this: TuringLang/AbstractPPL.jl#26?
I have no idea what that's doing but if you claim it helps I'll believe you. :sweat_smile:
I have no idea what that's doing but if you claim it helps I'll believe you. :sweat_smile:
Haha, I was referring to the example of sampling a DataFrame
rather than the entirety of the PR :sweat_smile:
From the previous issue on calculations for
pointwise_loglikelihood
:I'm unsure if this is the most intuitive behavior. The problem here is that, in general, the rows of a table represent independent observations, while each column is a different attribute. My suggestion would be to use the following set of rules for
.~
broadcasting:pointwise_loglikelihood
and any similar functions. (Although internally, it should be treated as a broadcast over columns -- this should improve performance, since columns usually have only one type.).~
will throw a dimension mismatch if the user meant to broadcast over rows. We can assume the user meant to broadcast over columns because Julia is column major without risking too much.This is annoying given Julia is column-major, but I think it's the best solution to avoid errors in data analysis.