elbamos / largeVis

An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R
340 stars 63 forks source link

searchReverse not correctly finding mutual nearest neighbors #57

Closed jlmelville closed 6 years ago

jlmelville commented 6 years ago

Hi. First of all, thank you for your work on the package.

In edgeweights.cpp, I think the searchReverse routine is supposed to find mutual nearest neighbors found during the nearest neighbors routine. Later in the run method, non-mutual nearest neighbors are created with an edge_weight of 0 so that the wij matrix is symmetrized correctly.

I've found that in most cases, searchReverse doesn't identify already existing mutual nearest neighbors. In the getWIJ method, the locations matrix ends up with lots of duplicates, which is why wij has to be constructed with the add_values parameter set to true.

I think the problem is in the inner loop declaration of the searchReverse method:

  void searchReverse(vertexidxtype id) {
    edgeidxtype p, q;
    for (p = head[id]; p >= 0; p = next[p]) {
      for (q = head[id]; q >= 0; q = next[q]) {
        if (edge_to[q] == id) break;
      }
      reverse[p] = q;
    }
  }

I think the inner loop ought to be:

      for (q = head[edge_to[p]]; q >= 0; q = next[q]) {

i.e. we ought to be searching the other end of the edges incident to id. This doesn't seem to affect any of the unit test results.

elbamos commented 6 years ago

You may be right. When I implemented the wij functionality from scratch, it didn’t match the wij’s generated by the reference implementation. In going through the ref implementation code, I was never able to figure out why it would produce the math described in the paper. I ultimately ended up using that part of the reference implementation code with very few modifications, because I felt it was better to match the reference implementation than the paper.

How does changing it in the way you describe affect the quality of the visualizations?

On Jun 2, 2018, at 6:05 PM, James Melville notifications@github.com wrote:

Hi. First of all, thank you for your work on the package.

In edgeweights.cpp, I think the searchReverse routine is supposed to find mutual nearest neighbors found during the nearest neighbors routine. Later in the run method, non-mutual nearest neighbors are created with an edge_weight of 0 so that the wij matrix is symmetrized correctly.

I've found that in most cases, searchReverse doesn't identify already existing mutual nearest neighbors. In the getWIJ method, the locations matrix ends up with lots of duplicates, which is why wij has to be constructed with the add_values parameter set to true.

I think the problem is in the inner loop declaration of the searchReverse method:

void searchReverse(vertexidxtype id) { edgeidxtype p, q; for (p = head[id]; p >= 0; p = next[p]) { for (q = head[id]; q >= 0; q = next[q]) { if (edge_to[q] == id) break; } reverse[p] = q; } } I think the inner loop ought to be:

  for (q = head[edge_to[p]]; q >= 0; q = next[q]) {

i.e. we ought to be searching the other end of the edges incident to id. This doesn't seem to affect any of the unit test results.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

jlmelville commented 6 years ago

FWIW, this seems to have been a bug in the reference implementation: https://github.com/lferry007/LargeVis/commit/919af30bd2d400c822262c348e9c014729b3f446#diff-f7a8edb67de3c04e1ec98f8469655da7

I will get back to you on the effect on the visualization of the change, although the largest datasets I will check will be MNIST and Fashion-MNIST.

jlmelville commented 6 years ago

Actually, seems like this is a bit of a non-issue: in all cases I looked at, the wij matrices are identical. Probably the worst that happens is that edge_weights are repeatedly averaged and the locations matrix used to initialize wij is unnecessarily large, but neither has an overall effect, except a potential increase in runtime and RAM usage during buildWijMatrix, but I've not measured if it's meaningful.

Might still be worth changing to be in line with the reference implementation, but closing anyway.