Batch and inplace logpdf/pdf

lindahua commented 11 years ago

I think the following is useful.

r = logpdf(d, x) # x is a set of samples
logpdf!(r, d, x)  # r is the output array

Most of the important distributions (except for Uniform distribution) are exponential family. It means that the core part in computing logpdf is to evaluate dot-product between parameters and the sufficient statistics. When evaluating logpdf for a set of samples, BLAS functions can be used to speed up the computation (often drastically).

Currently, batch evaluation is implemented for many univariate distributions, but it is still lacking for some multivariate distributions.

Inplace evaluation is also important. In a lot of inference/estimation algorithms (e.g. EM), one has to repeatedly evaluate logpdf at each iteration (on the same set of samples). It would be much more efficient to put the results to a pre-allocated array, and creating a new array every time.

Generally, I think we can do it in this way. Implementing a specialized method logpdf! for each distribution type. And, write a logpdf on abstract distributions in the following way

function logpdf{T<:Real}(d::UnvariateContinuousDistribution, x::Array{T})
    r = Array(T, size(x))
    logpdf!(r, d, x)
    r
end

function logpdf{T<:Real}(d::MultivariateContinuousDistribution, x::Matrix{T})
    r = Array(T, size(x, 2))
    logpdf!(r, d, x)
    r
end

Similar things can be done for discrete distributions, and we should do the same for pdf.

johnmyleswhite commented 11 years ago

Yes, these would be very helpful.

lindahua commented 11 years ago

Finally, I come to the point to work on this.

johnmyleswhite commented 11 years ago

I'd like to go through and ensure this method exists for every distribution. Are you opposed to changing the order of arguments to logpdf!(d, x, r) instead of the current logpdf!(r, d, x)?

lindahua commented 11 years ago

The consideration of using logpdf!(r, d, x) is to make it consistent with other kinds of probabilistic models that involve multiple variables (e.g. conditional distributions):

Consider a simple model as below

y ~ N(a' x, sigma)

This is a probabilistic formulation of a linear regression model. In such a model I wish to be able to write

logpdf!(r, d, x, y)

In some more generic algorithms (e.g. estimation of finite mixture models), it is nice to be able to write

logpdf!(r, d, x...)

I am actually using such syntax in a probabilistic inference package, which I am still working on.

Within the scope of this package, I think either way is fine. However, the latter way allows to enforce consistency across packages from a broader perspective.

johnmyleswhite commented 11 years ago

The varargs case is a very compelling argument. Do we have any distributions currently implemented that use varags.

I am a big fan of consistency, so I'd like to clean this up. What troubles is that we already have inconsistencies: rand!(d, A) has the mutating argument at the end, whereas logpdf!(r, d, x, ..) has it at the front.

lindahua commented 11 years ago

This issue had been addressed weeks ago. Therefore, I close this.

johnmyleswhite commented 11 years ago

Ok. I do still wish we could standardize on placing the mutatable arguments to functions at the front of the argument list, but this is major change to rand!.

JuliaStats / Distributions.jl

Batch and inplace logpdf/pdf #23