PoisotLab / SimpleSDMLayers.jl

Simple layers for species distribution modeling and bioclimatic data
https://docs.ecojulia.org/SimpleSDMLayers.jl/stable/
MIT License
19 stars 2 forks source link

Integration with `MultivariateStats.jl`, enabling PCA/whitening across many SDMLayers #132

Open gottacatchenall opened 3 years ago

gottacatchenall commented 3 years ago

What the pull request does
Implements methods for taking a set of SDMLayers and applying multivariate transformations from JuliaStats/MultivariateStats.jl to do a variety of things: reduce the dimensionality of a data set (PCA, PPCA, KernelPCA), make layers without correlation (Whitening), or the other methods mvstats provides Type of change

Please indicate the relevant option(s)

Checklist

codecov[bot] commented 3 years ago

Codecov Report

Merging #132 (f2b8627) into main (b3b0f32) will increase coverage by 0.95%. The diff coverage is 55.04%.

@@            Coverage Diff             @@
##             main     #132      +/-   ##
==========================================
+ Coverage   54.13%   55.08%   +0.95%     
==========================================
  Files          35       36       +1     
  Lines        1101     1120      +19     
==========================================
+ Hits          596      617      +21     
+ Misses        505      503       -2     
Flag Coverage Δ
unittests 55.08% <55.04%> (+0.95%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/datasets/earthenv/topography.jl 0.00% <0.00%> (ø)
src/datasets/layernames.jl 0.00% <0.00%> (ø)
src/datasets/worldclim/elevation.jl 0.00% <0.00%> (ø)
src/integrations/DataFrames.jl 0.00% <0.00%> (ø)
src/operations/mask.jl 0.00% <0.00%> (ø)
src/operations/sliding.jl 0.00% <0.00%> (ø)
src/pseudoabsences/radius.jl 0.00% <0.00%> (ø)
src/pseudoabsences/randomselection.jl 0.00% <0.00%> (ø)
src/pseudoabsences/surfacerangeenvelope.jl 0.00% <0.00%> (ø)
src/recipes/recipes.jl 12.90% <12.90%> (-0.17%) :arrow_down:
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2e5dc79...f2b8627. Read the comment docs.

tpoisot commented 3 years ago

Can I be a nuisance? I'd like it a lot if we could do something like

W = fit(PCA, stack_of_layers)
smaller_stack_of_layers = transform(W, stack_of_layers)

In some cases, we don't want to reduce the dimensionality as much as we would like to e.g. whiten the data. I'd drop that in a file loaded when MultivariateStats is loaded, using Requires.

tpoisot commented 3 years ago

This makes me realize that we may want to have a new type, like SimpleSDMStack, to store layers that pass the _layers_are_compatible test. This is one we can use in mosaic (even though mosaic works when the overlap is incomplete), but would definitely work here. If we want to be smart about it, we can also have a 3d matrix and an array of names, and from this we extract the layers as required. I can suggest something in another PR.

gottacatchenall commented 3 years ago

overloading fit and transform should be easy enough

gottacatchenall commented 3 years ago

@tpoisot mostly there, current code works for PCA and PPCA but throws errors for KernelPCA and Whitening, although those could be a product of using randomly generated matrices without correlation in the current tests.

also given the function call structure it requires computing the input matrix in both fit and transform, might start to cause performance concerns for big/many layers

do you want to merge this into a different branch to implement SDMStacks?

tpoisot commented 3 years ago

I don't think this needs to have stacks yet - but we can merge, and add the stacks later?

gottacatchenall commented 3 years ago

just pushed docs and a type assert for PCA/PPCA for now, do you merge into a dev branch before merging to main and tagging? @tpoisot

tpoisot commented 3 years ago

So I don't think a type assert is required - Whitening fails sometimes because it requires a specific matrix type.

I'll review and merge/tag, there's no dev branch.

gottacatchenall commented 2 years ago

removed the type assert

tpoisot commented 2 years ago

Can you add a documentation update?

tpoisot commented 2 years ago

there's a _make_input method that doesn't look defined

gottacatchenall commented 2 years ago

looks like i dropped that when shifting to the common_keys = reduce(∩, keys.(layers)) call but didn't add it to fit, should work now

tpoisot commented 2 years ago

Any reason why the plots here https://docs.ecojulia.org/SimpleSDMLayers.jl/previews/PR132/examples/pca/ have values in 0-1?

gottacatchenall commented 2 years ago

right now transform returns map(f->rescale(f, (0,1)), newlayers). primarily did this to emphasize output dimensions are unitless but MVStats doesn't do this so might make sense to remove it

also fit(Whitening, layers) is still failing with non-posdef matrix on the same input data as the doc example

tpoisot commented 2 years ago

Yeah let's not do the rescale, because we might want to backproject at some point

tpoisot commented 2 years ago

also fit(Whitening, layers) is still failing with non-posdef matrix on the same input data as the doc example

this probably "just" need a little explanation that sometimes Whitening giveth, and sometimes it taketh away

gottacatchenall commented 2 years ago

fixed the whitening example (i think), needed a slightly different overload of transform

gottacatchenall commented 2 years ago

actuatlly i might be overcomplicating it a bit

gottacatchenall commented 2 years ago

The implementation of transform could be one function but runs into ambiguity with

transform(f::Whitening, x::AbstractVecOrMat{T} where T) in MultivariateStats transform(proj::T, layers::AbstractVecOrMat{U}, kwargs...) where {T, U<:SimpleSDMLayer} in SimpleSDMLayers

gottacatchenall commented 2 years ago

now implemented via a single function call, the only irritant being in the case that if SimpleSDMLayers and MultivariateStats are both in the namespace, it requires explicit usage of SDM.transform/MV.transform

tpoisot commented 2 years ago

now implemented via a single function call, the only irritant being in the case that if SimpleSDMLayers and MultivariateStats are both in the namespace, it requires explicit usage of SDM.transform/MV.transform

Wait, there's a transform in SDM?

gottacatchenall commented 2 years ago

not at the moment, but implementing a transform method within the SDM namespace (only in the case MVStats is also loaded) is a way to avoid the method ambiguity issue above (which in the long term could probably be solved by a PR within MVStats.jl)

edit: no need to export if it is dependent on MVStats being loaded, it just requires SDMs.transform() which isn't ideal

essentially this is a problem with MVStats.jl dispatch though

gottacatchenall commented 2 years ago

In the long term I think the julianesque solution is for MVStats.jl to define all its (fit, transform) methods on different models (pca, ppca, whitening) around an a single method dispatched on a single abstract type, as opposed to their current definitions which are fit(::PCA, matrix) = ..., fit(::Whitening, matrix) = ..., etc.

tpoisot commented 2 years ago

So can't we temporarily solve this with generated code?

gottacatchenall commented 2 years ago

macros are probably the feature of julia i understand the least, but if that's a neat way to resolve the transform function call ambiguity i'm all ears

tpoisot commented 2 years ago

Let me try

tpoisot commented 2 years ago

OK so I don't think it needed the code generation - I'll check the example page before releasing

gottacatchenall commented 2 years ago

yeah it runs into the same dispatch ambiguity without defining a transform in the SDM namespace

gottacatchenall commented 2 years ago

there's also an inconsistency in how transform is defined across different mv transforms in MVStats.jl which should probably get raised as an issue there

gottacatchenall commented 2 years ago

seems there is an MVStats.jl issue about exactly this issue, not sure how active MVStats.jl development is

tpoisot commented 2 years ago

Mhhhh. Running test doesn't show a method error on my side, are you sure the test suite is complete?

gottacatchenall commented 2 years ago

i think it doesn't fail because i use newlayers for both plots in the docs src. but clearly doesn't work because the plot under the whitening example in the docs is the same as the above plot and should only be 2 maps not 4

gottacatchenall commented 2 years ago

running locally gives me the same dispatch ambiguity

gottacatchenall commented 2 years ago

and i should 100% add a test of Whitening, the issue is how to get a predefined input that it doesn't randomly fail on w/ pos-def matrix problem

gottacatchenall commented 2 years ago

Visited this again, still can't find a simple way around this ambiguity

MethodError: transform(::MultivariateStats.Whitening{Float64}, ::Vector{SimpleSDMResponse{Float64}}) is ambiguous. Candidates:
    transform(f::MultivariateStats.Whitening, x::AbstractVecOrMat) in MultivariateStats at /home/michael/.julia/packages/MultivariateStats/HTpHt/src/whiten.jl:38
    transform(proj::P, layers::AbstractVecOrMat{U}; kwargs...) where {U<:SimpleSDMLayer, P<:Union{MultivariateStats.PCA, MultivariateStats.Whitening}} in SimpleSDMLayers at /home/michael/code/SimpleSDMLayers.jl/src/integrations/MultivariateStats.jl:26
  Possible fix, define
    transform(::P, ::AbstractVecOrMat{U}) where {U<:SimpleSDMLayer, P<:MultivariateStats.Whitening}

The possible fix from the error message is implemented but doesn't actually fix the problem