Bersaelor / KDTree

Swift implementation of a k-dimensional binary space partitioning tree.
MIT License
156 stars 27 forks source link

Known limitations of KDTree? #44

Open robert-dodier opened 5 years ago

robert-dodier commented 5 years ago

Hi, I'm working with about 13,000 points in 34 dimensions. I'm loading the data from a file and then creating a KDTree via KDTree(values: a) where a is my array of 34-element vectors (instances of a class which I created).

Is there any reason to think that shouldn't work? I am getting the following after reading the data and then trying to construct the KDTree:

Execution interrupted. Enter code to recover and continue.
Enter LLDB commands to investigate (type :help for assistance.)
Process 26088 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x7ffeef3ffff8)
    frame #0: 0x00007fff5562dcf8 libcorecrypto.dylib`ccaes_vng_ctr_crypt + 4
libcorecrypto.dylib`ccaes_vng_ctr_crypt:
->  0x7fff5562dcf8 <+4>:  pushq  %r15
    0x7fff5562dcfa <+6>:  pushq  %r14
    0x7fff5562dcfc <+8>:  pushq  %r13
    0x7fff5562dcfe <+10>: pushq  %r12
Target 0: (repl_swift) stopped.

It's not clear to me what's going on -- I assume that it's an out of memory error? or something else? (How would I figure out what kind of error it is?)

What is the most expensive part of constructing a KDTree -- where is the bottleneck? Does KDTree create any auxiliary indices or other data structures which might be large?

Thanks in advance for any insights you can offer.

Robert Dodier

PS. I am working in the Swift repl with Swift 5.0 (current snapshot build). I disabled printing out the values of items declared via :set set print-decls false (so it isn't the case that the KDTree gets constructed and then the repl fails while trying to print a representation of it).

Bersaelor commented 5 years ago

Mhmm, I haven't seen this error before. Unfortunately it is a rather cryptic one and KDTree doesn't even have a dependency on libcorecrypto. Is it possible that REPL fails due to heavy load in KDTree code and then the error occurs during some regular execution in libcrypto?

That said there are the LoadTest and PerformanceTest files where we routinely run the code on test sets of 500000 points to make sure it's still working.

Could you try the same thing you are doing in Swift4 ? Maybe what you discovered is a general bug in Swift 5 , since it's not stable yet?

robert-dodier commented 5 years ago

Hi, thanks for your reply. Looks like the error is not in the crypto library, that was just incidental. The problem appears to be that the program runs out of memory when constructing the KDTree instance.

I ran my example program in Xcode 10.0 + Swift 4.2 and after loading the data (about 14,000 items with 34 dimensions each) the program was using 60 MB. After entering KDTree.init, memory use increased greatly, and Xcode stopped the program when it reach about 1000 MB. At the moment it stopped, KDTree.init was in partitionLomuto, at the place where randomIndex is created -- the random number generator is apparently in libcrypto, so that's connection with the crypto stuff. Of course that's just incidental. It appears that the real problem is that building the KDTree requires a very large amount of memory.

I wonder, what is known about how memory use varies with the number of cases and the number of dimensions in each case? Is there a strategy to limit the amount of memory used? (E.g. limiting the depth of the tree or something like that.)

Bersaelor commented 5 years ago

That is very interesting indeed @robert-dodier , I haven't tested it that much with dim > 10 thus far.

Would you be able to distill your example into a small test-case, that we could add as a new test file to the test suite? Then we can play around with the memory usage?

@hsnetzer implemented the partitionLomuto, maybe we can look at the code to see whether there are cases where we could nudge ARC to release some intermediate variables earlier?

Also, @robert-dodier , could you run a test with the algorithm the way it was before we merged https://github.com/Bersaelor/KDTree/pull/38 . That PR by @hsnetzer improved performance. In my experience CPU usage and memory usage optimizations are often tradeoffs between each other.

hsnetzer commented 5 years ago

Certain data sets cause very deep recursion. Maybe you have lots of duplicate points. If that’s the problem then the earlier sorting algorithm should also fail. All I can think of at the moment. Sorry this is happening in my code.

robert-dodier commented 5 years ago

Hi, thanks for your replies. I will try to post a condensed version of the problem using some made-up data. The data are a little funny in the sense that for many vectors, there are a number of elements in the vector which are equal to 0 -- it would be interesting to know if that changes things. Or perhaps it's just the number of dimensions (i.e., 34). It might be a couple of days before I can return to this topic.

robert-dodier commented 5 years ago

I've attached 3 files that might help with testing the KDTree code to see what is going on. It seems like there are some different possibilities: maybe the number of cases + dimensions is the problem? maybe the large number of identical (zero) values? maybe the small number of distinct nonzero values? I've created these data sets to embody these different characteristics.

Unfortunately I can't devote more time to these questions right now, I might be able to come back to it next month. Perhaps this is useful in some way all the same.

generated_different.csv: same number of cases, same number of dimensions, same min and max values, random values between min and max

generated_similar.csv: same number of cases, same number of dimensions, same min and max values, same proportion of 0 values, random values between min and max

generated_very_similar.csv: same number of cases, same number of dimensions, same min and max values, same proportion of 0 values, random values chosen from the distinct values of each dimension

generated_different.csv.gz generated_similar.csv.gz generated_very_similar.csv.gz