error is -nan - Githubissues

parashardhapola commented 3 years ago

Hi,

I get no values in the output file. I have pasted the log below. Do have an idea why does it show nan for error?

Number of vertices: 3564403
Embedding dimensions: 2
Rescaling parameter λ: 1
Early exag. multiplier α: 50
Maximum iterations: 200
Early exag. iterations: 100
Box side length h: 0.4
Drop edges originating from leaf nodes? 0
Number of processes: 1
35437 out of 3564403 nodes already stochastic
Skipping λ rescaling...
Nested dissection permutation...Permuting matrix
m = 3564403| n = 3564403| nnnz = 67496192
Working with double precision
Iteration 1: error is -nan
Iteration 50: error is -nan (50 iterations in 63.7212 seconds)
Iteration 100: error is -nan (50 iterations in 60.3744 seconds)
Iteration 150: error is -nan (50 iterations in 60.4697 seconds)
Iteration 199: error is -nan (50 iterations in 59.2422 seconds)
 --- Time spent in each module --- 

 Attractive forces: 81.2666 sec [35.8047%] |  Repulsive forces: 145.705 sec [64.1953%]
Saving embedding to: test123.out

Thanks, Parashar

fcdimitr commented 3 years ago

Hello,

I would have to look at the input, but a quick guess is that your graph has isolated nodes. Check whether you have zero row- or column-sums. If that is the case, an easy solution is to remove the empty rows/columns before running SG-t-SNE.

Let me know if this solves the issue, Dimitris

parashardhapola commented 3 years ago

Hi Dimitris,

Thanks for your quick response. I really think that this graph based tSNE can be very useful for the single-cell genomics community. I have tested it on multiple datasets but am having this issue with this particular dataset.

Since the file is rather large I have deposited it here: https://osf.io/byu4f/ test123.mtx is the KNN graph file I'm trying to load and test123.ini has initial embedding. These files are generated using a pipeline through which I have processed multiple other datasets (up to 2M vertices) and they have performed very well without this issue.

I have tested this graph and it has no isolates and disconnected components. Also, the symmetrized matrix (MTX file is not symmetrized because sgtsnepi does it internally) has no row or column with zero sums.

I have tried other graphs (without symmetrization) that have zero column sums (no incoming edge, in terms of the graph) and even then they seem to work perfectly.

I'm sorry I'm unable to provide a smaller version of the MTX file as I have no clue how to find the problematics vertices in this case.

Please let me know what other information I can provide.

Thanks a ton!

Parashar

parashardhapola commented 3 years ago

Here is a quick summary that I generated for a few datasets that I have tested.

    Dataset  nVertices    ZSR  ZSC   ZWE

   pbmc_10K       7399    576    0     0
   pbmc_68K      62238    5333   0     8
immune_600K     728870   35037   0    71
  neuron_1M    1162548   53705   0     9
    moca_2M    1819780   91465   0     7
   fetal_4M    3564746  660268   0  9644

ZSR: Number of rows where sum is 0 (zero indegree) ZSC: Number of columns where sum is 0 (zero outdegree) ZWE: Number of edges where weight is 0

The last dataset fetal_4M is the problematic one. I have shared the vertices from the same but after the removal of 3 small disconnected components. Though, in practice I have found that having multiple disconnected components is usually not a problem for sgtsnepi.

fcdimitr commented 3 years ago

Thank you for sharing all of these details, and for helping us resolve possible bugs in the SG-t-SNE software. I will try to reproduce the error using the data you uploaded. I will let you know whether this is an internal bug, or if there is an issue with the input data.

pitsianis commented 3 years ago

Dear @parashardhapola thank you for reaching out and sorry for the issue. We are working on optimizations that should improve SG-t-SNE even further. I would like to hear more about the applications you are using it for.

ailiop commented 3 years ago

@parashardhapola Thank you for reporting this issue. It is great to hear that you are finding SG-t-SNE useful in your work. I share @pitsianis's interest in learning some more about the applications in which you are using it.

I took a quick look and was able to replicate the issue. It appears your sparse adjacency matrix has 9644 explicit zeros, which in turn lead to a 0.0 / 0.0 --> NaN computation when making the matrix stochastic. You can verify the presence of explicit zeros by running the following on a bash shell:

egrep '^[0-9]+ [0-9]+ 0.0$' test123.mtx | wc -l

Removing the explicit zero entries from the .mtx file (and amending the number of nonzeros in its header accordingly) seems to fix the issue:

Number of vertices: 3564403
Embedding dimensions: 2
Rescaling parameter λ: 1
Early exag. multiplier α: 12
Maximum iterations: 1000
Early exag. iterations: 250
Box side length h: 0.7
Drop edges originating from leaf nodes? 0
Number of processes: 16
35437 out of 3564403 nodes already stochastic
Skipping λ rescaling...
Nested dissection permutation...DONE
m = 3564403 | n = 3564403 | nnz = 67478012
Working with double precision
Iteration 1: error is 184.456
Iteration 50: error is 184.456 (50 iterations in 23.5281 seconds)
Iteration 100: error is 184.456 (50 iterations in 26.154 seconds)
Iteration 150: error is 184.456 (50 iterations in 27.6392 seconds)
Iteration 200: error is 184.456 (50 iterations in 29.5949 seconds)
Iteration 250: error is 12.8864 (50 iterations in 25.4882 seconds)
^C

(In the output above, I was using default parameters from the demo and halted the execution manually.)

pitsianis commented 3 years ago

Great job @ailiop, we should check and provide an error message, or a warning and recover. Then simply place all isolated nodes as a constellation (regular polygon vertices).

Side observation: we should also check why the speed up is so small with 16 processes compared to 1. How many physical cores did you use?

parashardhapola commented 3 years ago

Hi @ailiop. Thanks for debugging this. The 0 values in the sparse matrix were indeed the issue. However, I'm still unable to understand why these zero values are not an issue in the other datasets. As you can see in the table I shared earlier, the ZWE column shows multiple zero values for many of the datasets. Any ideas? If it will be helpful then I can share those MTX files as well.

Another point: Simply removing the nodes might not be a great idea. I would rather reset 0 values to a minimum non-zero value in the matrix. What do you think about it? My concern is the removal of vertices might disrupt the index of cells and cause downstream issues. Or should this problem be handled upstream when creating the KNN matrix?

Best regards, Parashar

ailiop commented 3 years ago

@parashardhapola Sorry, my initial response was a little hasty. You are correct, a zero-weight matrix element does not in itself lead to NaN values. If I am not mistaken, the necessary condition for this behavior is having an explicit zero-weight element in a zero-sum row.

I agree with you about removing nodes. Indeed, the "fix" I tested yesterday only involved removing matrix elements (i.e., connections), not rows/columns (i.e., nodes). You could consider that as being part of kNN matrix formation. I expect this wouldn't lead to downstream issues, but of course that depends on your downstream application.

That said, I also agree with @pitsianis: this is something we could check for and address in a pre-processing step. We may update the code soon if we can find the time.

ailiop commented 3 years ago

@pitsianis My run was on my laptop, which has an 8-core Intel i7-10875H with hyperthreading. I would caution that I was running all sorts of other programs in the background. I did, however, notice some possible issues with scalability while testing the port to OpenCilk (yet to be merged). I wanted to take a closer look using the Cilkscale tool but haven't found the time yet.

parashardhapola commented 3 years ago

Thanks, @ailiop, and others for your help. I will now close this issue.

@pitsianis: I will drop you an email on the address I found here: http://users.auth.gr/pitsiani/#contact There I will able to describe my application in proper detail.

A quick note to others who land on this issue in the future: you can resolve the issue by removing edges with zero weights in the matrix. Please see @ailiop's reply above to know why zero-weight edges can sometimes lead to this issue.

Best regards, Parashar

fcdimitr / sgtsnepi

error is -nan #4