Bersaelor / KDTree

Swift implementation of a k-dimensional binary space partitioning tree.
MIT License
153 stars 26 forks source link

Dynamic datapoint dimensions (fixed per tree) #53

Open AlexDM0 opened 4 years ago

AlexDM0 commented 4 years ago

Hi there! We'd like to use your KDTree classifier for indoor localisation in a smart home. I liked your talk so I was looking into experimenting with this lib.

Our datavectors (per house for instance) have X dimension, but another house may have Y dimensions. The user is in control of adding houses, and he determines how many dimensions a vector has (by having more/less beacons).

We want to generate a tree for each house and the issue then becomes the static let dimensions. We can set the dimensions of all datapoints by Element.dimensions = X. We could keep a number of types beforehand, but keeping this artificial limit on the amount of houses seems a bit ugly. I can't see a way to have a "type-factory" that generates these structs with custom dimensions either.

I could clone your lib and change the 3 instances where you take it from Element and instead take it from the provided element itself. This makes it of course the responsibility of the user to ensure all datapoints have the same dimensions.

What's your take on this?

Bersaelor commented 4 years ago

First , thanks for the interest, it's so great to hear when people are actually using the code!

Mhmm, good question about the dimensions though. So this would make static var dimensions: Int { get } -> var dimensions: Int { get } an instance variable and shouldn't change too many things but opens the issue of potential size mismatches at runtime as you said.

I mean, you can fork the project, apply your proposed changes and then make a Pull Request.

Then we can see how many changes are necessary and what the impact on the codebase and on the potential performance are?

AlexDM0 commented 4 years ago

I'd be happy to make a pull request, but I'm not exactly sure how to properly do it while protecting the user from not putting different dimensions into it.

I'm still pretty into OO design, ( i know..), so from that point of view, hey a tree should just know it's own dimensions. But enums don't do that.

The init can take the first element and use the dimension of that one, we could throw error on init without values, we should loop over all values to check dimensions and throw error on difference. Or we provide a dimensions value in the init, but we can't store it nor check for it in future operations.

The insert and remove operations can also get the dimension from the provided value, or get it as an extra argument.

From what I see, that's it. I don't know how to make it so that we constantly guard the user against user error. I'd do it myself, for my purposes, I'll embrace the terrible crash if I mess up :). That's not friendly for most users though and shouldn't make it into the core.

It will also break everyones code, since the base datapoint protocol needs changing.

If you know of a good way around this I'm open to building it (except full rewrites :p)

EDIT: removed unfinished sentence.

Bersaelor commented 4 years ago

Oh well, I'd say try to open a PR, to see that tests succeed and all and we go from there :)

This way we can discuss the individiual changes in the code much better.

(each node in the tree has an element, so if you add a new node, we can check f the new node has the same dimension, so it shouldn't be to hard to make a good error message).

AlexDM0 commented 4 years ago

I have made a pull request without breaking anything, any thrown errors will make breaking changes. It is not as friendly as could be, but at least it is fully optional.

Bersaelor commented 4 years ago

Mhmm, it looks like the automatic Unit tests aren't running through. I'll probable have to look into that. Unfortunately rather busy this weekend and afk, give me a few days and I'll look into this.

AlexDM0 commented 4 years ago

Sure, no worries. I’ll just continue building my integration after the weekend and we can see how (and if) you want to integrate the result.

Bersaelor commented 4 years ago

Just for understanding, @AlexDM0 , in your use-case, the number of dimensions is very diferrent per house, it's not just that all houses fall into 2D, 3D and 4D categories, or s similarly limited list?

AlexDM0 commented 4 years ago

A house contains a number of broadcasting beacons, which we sell (called Crownstones). Each user can have a different amount of beacons in his house. A user can be invited to join the houses of friends or an office, which can also contain a different number of beacons.

I have 35 Crownstones in my house, so that is 35 dimensions. Another user may have 4, or 10. The more you have, the more accurate the room level localisation is.

We have testing houses with 50+ Crownstones. We could limit the amount of houses to say 20, and keep 20 instances of datapoint definitions, where we can set the dimensions at runtime, but that's a little dirty :)

Bersaelor commented 4 years ago

Mhmm, yes, I see.

So each beacon can then have a Float-value, and a datapoint that you read has a float-distance for each beacon? like

struct Beacon {
    // ... 
    func value() -> Float { .. }
}

let beacons: [Beacon] = [ ... ] // n beacons
var dataPoint = beacons.map { beacon in
    return beacon.value()
}
// dataPoint is now [Float] with size n 

And now for each house you have a Set of Float-vectors where each vector has the same size for that house&app-install?

dataPoints: Set([Float])

Bersaelor commented 4 years ago

I'm wondering whether this is actually a rather frequent use case and whether we should build it more clearly into the type KDTreePoint by making dimensions an instance variable.. got to ponder the implications for a bit.

I know, the dimensionsoverride works, but it feels less clean.

AlexDM0 commented 4 years ago

Int value actually, they are signal strength indicators. I'll first construct a sorted list from all the collected datapoints. So index 0 is beacon X index 1, beacon Y etc. This array goes into each datapoint (with values for missing data) so all have the same dimensions. Each data point has a room label, and this sorted list. Every incoming dataset will be mapped to this sorted array, and queried in the KD tree.

I'll use the distance function with one we have here to get the distance estimate between points.

At least, that's what I had in mind so far. It's an experiment :). We have a number of difference classifiers and I wanted to add this method as a test.

It is indeed an ugly solution, but if we'd do a cleaner version, it would break everyones code since they use the current datapoint protocol.

I wouldnt blame you if you wouldnt merge this :) It is for a usecase that you not really designed for and thats perfectly fine.

Bersaelor commented 4 years ago

It might be that all people have to do is to delete the word static in front of static var dimensions = X, and the rest of the code should just work. Would be a breaking change, but it would increase the number of use-cases KDTree could be applied to.

Will look into it later today, I think there is also a way of annotating the code, so Xcode can make the right suggestion how people can fix their code when the load a new version of the repo.

AlexDM0 commented 4 years ago

Except you lose the guarantee that all datapoints in the tree have the same dimension. This is an important point for the tree.

AlexDM0 commented 4 years ago

You could have a class which gets the dimensions on initialisation, which can wrap your methods, do checks and error reporting and use the dimensionOverride internally. That way users won't see it, its safe, but its not functional anymore... pick your poison :)

Bersaelor commented 4 years ago

I think I can just create a warning in the inserting method. Also, I mean, if you use the overrideDimensions variable wrongly, you'll also get into trouble.

AlexDM0 commented 4 years ago

Yes, but you can document that with a warning. If you use this, its up to you to have your stuff in order. A sort of advanced usage. Most users likely wont need it anyway and they're safe due to the static dimensions.