baggepinnen / SlidingDistancesBase.jl

Defines distance_profile and utilities
MIT License
4 stars 1 forks source link

Questions about normalizers #5

Open ericphanson opened 4 years ago

ericphanson commented 4 years ago

Hi again @baggepinnen

I was wondering what IsoZNormalizer does, and if the undef's here are expected:

julia> IsoZNormalizer(rand(5,5), 5)
5×5 IsoZNormalizer{Float64}:
 #undef  0.887362   0.824518  0.57944   0.818286
 #undef  0.795543   0.666981  0.453675  0.620304
 #undef  0.714766   0.091258  0.826908  0.578601
 #undef  0.654904   0.197439  0.784394  0.554519
 #undef  0.0581505  0.613653  0.165785  0.449987

I was reading through https://www.cs.unm.edu/~mueen/DTW.pdf and they say that z-normalization is essential, and in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5668684/ they mention that in the multivariate case, each dimension should be z-normalized separately. Is that what IsoZNormalizer does?

baggepinnen commented 4 years ago

IsoZNormalizer is short for "isometric", which it started out as being. Now it's actually using a diagonal covariance matrix, i.e., a separate variance for each dimension, so the name is misleading and I should change that.

The undef is due to advance! not being called, it adds one point at a time so that if early stopping is employed, it does not do unnecessary work. You can see how it's used in the tests https://github.com/baggepinnen/SlidingDistancesBase.jl/blob/a124a374add252d8e7637da805c66b7f92c49826/test/test_normalizers.jl#L76

or in DTW https://github.com/baggepinnen/DynamicAxisWarping.jl/blob/fc9b65091911707ea4dab54d91bd7d0b3aebf966/src/dtwnn.jl#L201

I can add some docs for the normalizers if you want to add new ones, so far I only have Z and the poorly named IsoZ.

ericphanson commented 4 years ago

Ah okay, thanks for explaining. Some docs would be great! But I can't promise that I'll add new normalizers any time soon, so no worries if it's not a priority.

Right now, I am interested in using normalizers with sparse_distmat. However it seems like in this context it makes more sense to prenormalize each signal instead of doing it online (like in the dtwnn context). So I'm just doing

using StatsBase
function z_normalize!(X)
    dt = fit(ZScoreTransform, X, dims=2)
    StatsBase.transform!(dt, X)
end

z_normalize!.(y)

before passing y to sparse_distmat.

baggepinnen commented 4 years ago

Yes, if the goal is not to operate on sliding windows of a long sequence, the normalizer types have little benefit and you'd be just as well off normalizing in advance.

The interface for the normalizers got overly complicated, but I couldn't see a straightforward way of improving it so it's left at being complicated :/

Note that sparse_distmat is not super smart, and you might be able to improve upon the performance by clever use of some accelerating data structure. It actually computes all O(N^2) distances, but only stores a small amount of them. It does make use of some pruning and stuff like that, but something like a ball tree or a VPTree could potentially allow for even earlier termination or skipping some distance computations entirely.

It also doesn't use any threading which should be quite easy to add