MrUrq / LatinHypercubeSampling.jl

Julia package for the creation of optimised Latin Hypercube Sampling Plans
Other
36 stars 8 forks source link

Add periodicity to the objective #8

Closed eelregit closed 4 years ago

eelregit commented 4 years ago

First of all thanks a lot for the package.

I have used it to optimize a design of 2k points in 7D and afterwards optimize a selection of 200 points. However, I found that the interior of the subset looks emptier than the skin of the hypercube.

This didn't show up in a 2D toy example I did earlier. So I doubt this has to do with the fact that the skin to interior volume ratio grows rapidly with the number of dimensions.

A straightforward fix is to make the hypercube periodic, and always use the closest pair among all images when doing the pair summation.

Below is the plan and fitness of the plan with 2k points: plan fit

And the subset: subplan subfit

coveralls commented 4 years ago

Pull Request Test Coverage Report for Build 103


Totals Coverage Status
Change from base Build 101: 0.0%
Covered Lines: 181
Relevant Lines: 181

💛 - Coveralls
eelregit commented 4 years ago

The subset plan looks more non-uniform. The reason could be that there the points are not restricted to take all the values along each dimension.

MrUrq commented 4 years ago

Thank you for the PR, I've had a look and I think it could be useful to include the periodic way of calculating the sampling plan in this package. In that case I would like to make it an optional keyword argument for people to opt in to so that we could publish the package without breaking semver, and maybe later in a breaking update it could be made the standard.

Is this https://www.fce.vutbr.cz/STM/vorechovsky.m/papers/32j.pdf the paper you got the information from?

Looking at your implementation I think it is not quite following exactly what that paper did. I have posted the code below which I think does what that paper did. Please have a look and see what you think.

function _AudzeEglaisDist(LHC)
    n,d = size(LHC)
    dist = 0.0

    # Squared l-2 norm of distances between all (unique) points
    for i = 2:n
        for j = 1:i-1
            dist_tmp = 0.0
            for k = 1:d
                @inbounds dist_comp = abs(LHC[i,k]-LHC[j,k])
                dist_comp = min(dist_comp,n-1-dist_comp)
                dist_tmp += dist_comp^2
            end
            @inbounds dist += 1/dist_tmp
        end
    end
    output = 1/dist
    return output
end

For your problem I think it might look worse than it actually is. When plotting slices of multi-dimensional problems you might have close neighbours which aren't shown in the plot. In that case the 2D problems give a quick and visual way of checking the space filling while being able to show all the points. I think the same visualisation problem exists for the full set that you have plotted only that it is masked due to the number of samples shown being large. You could try recreating the plots of 200 samples but instead of using a subset, optimize the plan using the regular LHC function and see if the plots look similar.

eelregit commented 4 years ago

Keyword argument is a great idea. I would have done that if I speak julia :wink:

I didn't find that paper before submitting this PR, and then thought they are the same when google showed it to me afterwards. I didn't read it carefully but the patterns in Fig.3~4 resemble the 2k plan I posted above, where the corners look kind of emptier.

Regarding dist_comp = min(dist_comp,n-1-dist_comp), I think there could be 2 problems. One is that the former can be negative while bigger than the latter in magnitude. And also the period is n instead of n-1, e.g. 1 and 1+n are the same point, right?

Thank you for helping with my problem. The 2D panels are actually projections of the full 7D, so all points should appear on each panel. Just found a typo on that line in the PR and now it's fixed. Will post a plot if I manage to fix the sub plan problem.

eelregit commented 4 years ago

The sub plan now looks uniform subplan Interestingly the fitness score is now of a very different order of magnitude subfit

BTW would you mind explaining why adding the @inbounds in @inbounds dist += 1/dist_tmp?

MrUrq commented 4 years ago

Yes you are right the period is n. To solve the issue of negative distances there is an abs added to the dist_compposted above.

Both our implementations give the same answers now. I benchmarked each implementation, yours using round and mine with min as well as the standard one and got this:

using BenchmarkTools
X = randomLHC(500,500)

julia> @btime _AudzeEglaisDistMin(X)
  159.086 ms (1 allocation: 16 bytes)
83.53467143366808

julia> @btime _AudzeEglaisDistRound(X)
  193.023 ms (1 allocation: 16 bytes)
83.53467143366808

julia> @btime LatinHypercubeSampling._AudzeEglaisDist(X)
  96.597 ms (1 allocation: 16 bytes)
166.88670943156933

If you think it is fine I would like to use the implementation using min since it's a little bit faster and also follows the convention of the paper I found making the code easy to compare with the paper, what do you think? Please try both and confirm if you see the same result.

Ah okay, I didn't realize you projected the plots, that is nice! Could you post the code you used to make those plots? I could be useful to include in the documentation for other users to quickly explore their plans.

Thanks for posting the images, this look much better now. Regarding the fitness value it makes sense that it has been reduced since it will now use the periodic plan for some of the points which reduces the fitness value.

The last @inbounds is leftover code from when I recently rewrote the _AudzeEglaisDist function to increase the speed. In this case it does nothing and should be removed. (although it doesn't cause a slowdown either)

eelregit commented 4 years ago

Yes you are right. I missed the abs(). Your implementation is faster and should definitely be used. Thanks for showing the benchmark.

The plotting script is in python:

#!/usr/bin/env python3

import numpy as np
import matplotlib.pyplot as plt

# fitness

plt.semilogx(np.loadtxt('fit.txt'))
plt.savefig('fit.pdf')

# design

plan = np.loadtxt('plan.txt', unpack=True)

params = 7

fig, axes = plt.subplots(params - 1, params - 1,
        figsize=(4 * params - 4, 4 * params - 4))

for i in range(1, params):
    for j in range(i):
        axes[i-1, j].scatter(* plan[[i, j]], s=0.3)
    for j in range(i, params - 1):
        axes[i-1, j].axis('off')

plt.savefig('plan.pdf')

For subset it's almost the same just with bigger dots:

#!/usr/bin/env python3

import numpy as np
import matplotlib.pyplot as plt

# fitness

plt.semilogx(np.loadtxt('subfit.txt'))
plt.savefig('subfit.pdf')

# design

subplan = np.loadtxt('subplan.txt', unpack=True)

params = 7

fig, axes = plt.subplots(params - 1, params - 1,
        figsize=(4 * params - 4, 4 * params - 4))

for i in range(1, params):
    for j in range(i):
        axes[i-1, j].scatter(* subplan[[i, j]], s=1.0)
    for j in range(i, params - 1):
        axes[i-1, j].axis('off')

plt.savefig('subplan.pdf')

You are right -- with the range of interaction cut at half the box size, the fitness is much lower. The more dimension there is, the more long-range force/potential dominates due to a larger volume factor. In 2D, if we look at particles from an annulus of with d r around a point in a uniform distribution, then annulus with smaller radium r contributes more to the objective. In 3D, different spherical shells of width d r contribute about the same to the potential energy. In more than 3 dimension, farther hype-shells start to contribute more and dominate. This means that the optimization will prefer large-scale homogeneity than that on small scales. Maybe this isn't what we wanted? I wonder why people use the Audze-Eglais objective with fixed power law of -2. The above reference mentioned other powers. But did they have benchmarks for them?

MrUrq commented 4 years ago

Thanks for the plotting code, I will see if I can get that added to the documentation soon.

I will work on the implementation using an optional parameter to use the periodic calculation of AE. As well as document it.

Interesting observation, this is beyond my knowledge of sampling plans but if you find a different way of calculating the objective function this could be added. An alternative is also to allow user supplied functions that are used to calculate the objective.

eelregit commented 4 years ago

Thanks. Maybe another keyword argument of the power p can be added? There is some description around Eq5 in the above paper (see also).

Let the users provide their own objectives is for sure a good idea, maybe in another PR :smile:

For now I am using L^-d which I think balance the long and short ranges

@inline function _AudzeEglaisDist(LHC)
    n,d = size(LHC)
    dist = 0.0

    # Squared l-2 norm of distances between all (unique) points
    for i = 2:n
        for j = 1:i-1
            dist_tmp = 0.0
            for k = 1:d
                @inbounds dist_comp = abs(LHC[i,k]-LHC[j,k])
                dist_comp = min(dist_comp,n-dist_comp)
                dist_tmp += dist_comp^2
            end
            dist += 1/dist_tmp^(d/2)  ##### power -d
        end
    end
    output = (n/dist)^(1/d)  ##### average over points, also unify dimension
    return output
end
MrUrq commented 4 years ago

The code is done and a new version (v1.6.1) https://github.com/JuliaRegistries/General/pull/10849 is available to use with the periodic_ae and ae_power option. Use as

LHCoptim(2000,7,gens; periodic_ae=true, ae_power=2)

Large integer powers can cause overflow leading to the Audze-Eglais objective function to fail silently. Use floats for large powers at the cost of computational speed.

If you've got time please try the published version on the full plan and the subset and post the plots. Let me know if this has improved your problem.

I haven't included the normalization of the output as this is something the user could do.

There is currently a performance regression due to the inclusion of the power but I will try to improve it.

eelregit commented 4 years ago

Thanks. There's difference between the published version and my code above. My code computes the Euclidean distance squared before take the power of ae_power / 2. This agrees with the previous reference. Besides, doing Euclidean distance squared with integers is faster because after that only one floating point power is needed per pair of particles.

eelregit commented 4 years ago

BTW it'd be appreciated if you could include me as a coauthor in the commit. But no worry if it's too late.

MrUrq commented 4 years ago

Thank you for notifying me of the problem. It has been fixed now in version 1.6.2. Of course you should be coauthor, I wasn't aware of that feature! You should be added as a coauthor to the commit now.

eelregit commented 4 years ago

Thanks! :)