elbamos / largeVis

An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R
340 stars 63 forks source link

meaning of tree failure. #23

Closed sparktsao closed 7 years ago

sparktsao commented 7 years ago

Thank you for providing the great works, I got a question that some dataset will lead to "tree failure exception" in the function "copyHeapToMatrix". What does it means? how can I avoid it when preparing the dataset?

**********************************************terminate called after throwing an instance of 'Rcpp::exception'
  what():  Tree failure.
Aborted
elbamos commented 7 years ago

That's extremely odd-that error-check code is there to test the internal consistency of the implementation. Can you provide your data so I can take a look?

On Aug 28, 2016, at 9:33 AM, Spark Tsao notifications@github.com wrote:

Thank you for providing the great works, I got a question that some dataset will lead to "tree failure exception" in the function "copyHeapToMatrix". What does it means? how can I avoid it when preparing the dataset?

**terminate called after throwing an instance of 'Rcpp::exception' what(): Tree failure. Aborted — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

elbamos commented 7 years ago

What it means, essentially, is that during the tree-search part of the neighbor search algorithm, it found zero neighbors for a point. It should not be possible for that to happen. I would really appreciate knowing the details of your dataset and the parameters you were using. I'm guessing you found some sort of edge case, and I should add a check for it.

sparktsao commented 7 years ago

Thank you for the explanation.

When I largvis my dataset (dim:1600, number of record: 900K) successfully, I try to understand more about our data when dim is small, So I use feature reduction skill to reduce my 1600 dim. I found the Tree failure issue when our data dim < 30. (ex: i failed at 2, 13, 30). I am wondering if we meet some corner cases when dim too small, maybe too many data points fall into the same leaves which fail the random projection?

Another point is my working machine is not updated to the latest version, I stayed in version commit on Aug 4, 2016 (654da27f77b9b579e5482099b915b62176582051), because seems can handle more data than Commits on Aug 18, 2016 (580b2d251910c02463f8edc4daf83a682deedfab),

version 580b2d251910c02463f8edc4daf83a682deedfab will run out of memory when doing 70% of randomProjectionTreeSearch, seems use more memory than 654da27f77b9b579e5482099b915b62176582051. 654da27f77b9b579e5482099b915b62176582051 can handle all the data (1600*9000K) smoothly. Although you advised me to use gcc 4.9.3 to built 580b2d251910c02463f8edc4daf83a682deedfab successfully, i return to the old version due to the memory issue.

I will try to update to latest version to see if i will see the exception again. Thank you so much again!

elbamos commented 7 years ago

Can you elaborate on the memory issue and is it possible to see this data?

The relevant code in neighbor search hasn't changed in quite some time so memory usage in that phase should be constant. And reducing dims to ~30 shouldn't affect the tree search at all. (What might affect it are na's and Nan's though.)

Thank you for reporting this! I'd really appreciate your help nailing it down.

On Aug 29, 2016, at 7:12 AM, Spark Tsao notifications@github.com wrote:

Thank you for the explanation.

When I largvis my dataset (dim:1600, number of record: 900K) successfully, I try to understand more about our data when dim is small, So I use feature reduction skill to reduce my 1600 dim. I found the Tree failure issue when our data dim < 30. (ex: i failed at 2, 13, 30). I am wondering if we meet some corner cases when dim too small, maybe too many data points fall into the same leaves which fail the random projection?

Another point is my working machine is not updated to the latest version, I stayed in version commit on Aug 4, 2016 (654da27), because seems can handle more data than Commits on Aug 18, 2016 (580b2d2),

version 580b2d2 will run out of memory when doing 70% of randomProjectionTreeSearch, seems use more memory than 654da27. 654da27 can handle all the data (1600*9000K) smoothly. Although you advised me to use gcc 4.9.3 to built 580b2d2 successfully, i return to the old version due to the memory issue.

I will try to update to latest version to see if i will see the exception again. Thank you so much again!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

sparktsao commented 7 years ago

https://github.com/sparktsao/casetreefail I can reproduce the error message in 2 different AWS EC2 instances, but it's strange not always can reproduce the error message every run. maybe there is some random behavior in the function.

elbamos commented 7 years ago

The function does have random behavior as part of the algorithm, but that error should never occur. Thank you for posting the data - I will take a look tonight.

On Aug 30, 2016, at 10:28 AM, Spark Tsao notifications@github.com wrote:

https://github.com/sparktsao/casetreefail I can reproduce the error message in 2 different AWS EC2 instances, but it's strange not always can reproduce the error message every run. maybe there is some random behavior in the function.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

elbamos commented 7 years ago

Wait a sec... Your log seems to show that the current version performs properly, you're only getting the error on old release 0.1.5 is that right?

On Aug 30, 2016, at 10:28 AM, Spark Tsao notifications@github.com wrote:

https://github.com/sparktsao/casetreefail I can reproduce the error message in 2 different AWS EC2 instances, but it's strange not always can reproduce the error message every run. maybe there is some random behavior in the function.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

sparktsao commented 7 years ago

image 0.1.6?

elbamos commented 7 years ago

But why not use the current version?

On Aug 30, 2016, at 3:38 PM, Spark Tsao notifications@github.com wrote:

0.1.6?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

sparktsao commented 7 years ago

And Yes, latest version only output warning message without 'tree failure'. I choose stay in 0.1.6 here due to it can handle (1600 * 900K) smoothly. The program got "killed" when running large datatset (1600 * 900k) when using latest version. Might be run out of memory, It might not be an issue, due to it might be solved by increase memory. Sorry i didnt repeat that yet.

elbamos commented 7 years ago

Can you show me the data where the current version died? It should not be less memory efficient at all.

On Aug 30, 2016, at 3:53 PM, Spark Tsao notifications@github.com wrote:

And Yes, latest version only output warning message without 'tree failure'. I choose stay in 0.1.6 here due to it can handle (1600900K) smoothly. The program got "killed" when running large datatset (1600900k) when using latest version. Might be run out of memory, It might not be an issue, due to it might be solved by increase memory. Sorry i didnt repeat that yet.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

elbamos commented 7 years ago

Actually - one thing that did change after 0.1.6 was the default parameters. So what may be happening is that it's trying to use default settings, probably for tree_threshold, that are using more ram.

The reason for the change is to emulate the settings of the paper authors' reference code.

Try tamping-down the tree threshold. They set it way too big on high-D data.

On Aug 30, 2016, at 3:53 PM, Spark Tsao notifications@github.com wrote:

And Yes, latest version only output warning message without 'tree failure'. I choose stay in 0.1.6 here due to it can handle (1600900K) smoothly. The program got "killed" when running large datatset (1600900k) when using latest version. Might be run out of memory, It might not be an issue, due to it might be solved by increase memory. Sorry i didnt repeat that yet.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

elbamos commented 7 years ago

@sparktsao I just tried it, and with the default settings, it ran and completed on my machine in less than 3 seconds. It did not take long enough for me to even measure how much RAM was being used. I tried it up to K = 100.

(I do need to adjust that progress bar a bit...)

The reason why you're getting fewer neighbors found than you're looking for, by the way, is that approximately 1/3 of your dataset are duplicates.

> str(test)
 int [1:2, 1:25000] 28538 303513 174704 343275 52760 269921 183379 112205 52388 277515 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "x" "y"
  ..$ : NULL
> test <- data.frame(t(test))
> bob <- duplicated(test)
> sum(bob)
[1] 8295

Is there anything else you can do to help me reproduce the issue you're having?

sparktsao commented 7 years ago

The data i prepared is the minimum set of data i can reproduce tree failure case in build 654da27f77b9b579e5482099b915b62176582051, not for memory issue. If you are using latest version of the code, should be ok without tree failure exception and memory issue.

The default setting change might explain why i met the memory issue, Now I will try to use tree threshold parameters to find a good configuration for my large dataset. I will report if i meet memory issue in the future.

thanks so much for helping again.

elbamos commented 7 years ago

Ok I'm going to close this issue.

Regarding the tree threshold, I suggest you look at the benchmarks vignette. It includes a detailed discussion of how changing the threshold, the number of trees, and the number of exploration-iterations affects performance, memory usage, and accuracy. It is intended to be helpful to folks dealing with issues like yours -- if it doesn't get you to where you need to go, let me know and I'll try to improve it.