Closed parashardhapola closed 3 years ago
Hello,
I would have to look at the input, but a quick guess is that your graph has isolated nodes. Check whether you have zero row- or column-sums.
If that is the case, an easy solution is to remove the empty rows/columns before running SG-t-SNE
.
Let me know if this solves the issue, Dimitris
Hi Dimitris,
Thanks for your quick response. I really think that this graph based tSNE can be very useful for the single-cell genomics community. I have tested it on multiple datasets but am having this issue with this particular dataset.
Since the file is rather large I have deposited it here: https://osf.io/byu4f/
test123.mtx
is the KNN graph file I'm trying to load and test123.ini
has initial embedding.
These files are generated using a pipeline through which I have processed multiple other datasets (up to 2M vertices) and they have performed very well without this issue.
I have tested this graph and it has no isolates and disconnected components. Also, the symmetrized matrix (MTX file is not symmetrized because sgtsnepi does it internally) has no row or column with zero sums.
I have tried other graphs (without symmetrization) that have zero column sums (no incoming edge, in terms of the graph) and even then they seem to work perfectly.
I'm sorry I'm unable to provide a smaller version of the MTX file as I have no clue how to find the problematics vertices in this case.
Please let me know what other information I can provide.
Thanks a ton!
Parashar
Here is a quick summary that I generated for a few datasets that I have tested.
Dataset nVertices ZSR ZSC ZWE
pbmc_10K 7399 576 0 0
pbmc_68K 62238 5333 0 8
immune_600K 728870 35037 0 71
neuron_1M 1162548 53705 0 9
moca_2M 1819780 91465 0 7
fetal_4M 3564746 660268 0 9644
ZSR
: Number of rows where sum is 0 (zero indegree)
ZSC
: Number of columns where sum is 0 (zero outdegree)
ZWE
: Number of edges where weight is 0
The last dataset fetal_4M
is the problematic one. I have shared the vertices from the same but after the removal of 3 small disconnected components. Though, in practice I have found that having multiple disconnected components is usually not a problem for sgtsnepi.
Thank you for sharing all of these details, and for helping us resolve possible bugs in the SG-t-SNE software. I will try to reproduce the error using the data you uploaded. I will let you know whether this is an internal bug, or if there is an issue with the input data.
Dear @parashardhapola thank you for reaching out and sorry for the issue. We are working on optimizations that should improve SG-t-SNE even further. I would like to hear more about the applications you are using it for.
@parashardhapola Thank you for reporting this issue. It is great to hear that you are finding SG-t-SNE useful in your work. I share @pitsianis's interest in learning some more about the applications in which you are using it.
I took a quick look and was able to replicate the issue. It appears your sparse adjacency matrix has 9644 explicit zeros, which in turn lead to a 0.0 / 0.0 --> NaN
computation when making the matrix stochastic. You can verify the presence of explicit zeros by running the following on a bash
shell:
egrep '^[0-9]+ [0-9]+ 0.0$' test123.mtx | wc -l
Removing the explicit zero entries from the .mtx
file (and amending the number of nonzeros in its header accordingly) seems to fix the issue:
Number of vertices: 3564403
Embedding dimensions: 2
Rescaling parameter λ: 1
Early exag. multiplier α: 12
Maximum iterations: 1000
Early exag. iterations: 250
Box side length h: 0.7
Drop edges originating from leaf nodes? 0
Number of processes: 16
35437 out of 3564403 nodes already stochastic
Skipping λ rescaling...
Nested dissection permutation...DONE
m = 3564403 | n = 3564403 | nnz = 67478012
Working with double precision
Iteration 1: error is 184.456
Iteration 50: error is 184.456 (50 iterations in 23.5281 seconds)
Iteration 100: error is 184.456 (50 iterations in 26.154 seconds)
Iteration 150: error is 184.456 (50 iterations in 27.6392 seconds)
Iteration 200: error is 184.456 (50 iterations in 29.5949 seconds)
Iteration 250: error is 12.8864 (50 iterations in 25.4882 seconds)
^C
(In the output above, I was using default parameters from the demo and halted the execution manually.)
Great job @ailiop, we should check and provide an error message, or a warning and recover. Then simply place all isolated nodes as a constellation (regular polygon vertices).
Side observation: we should also check why the speed up is so small with 16 processes compared to 1. How many physical cores did you use?
Hi @ailiop. Thanks for debugging this. The 0 values in the sparse matrix were indeed the issue. However, I'm still unable to understand why these zero values are not an issue in the other datasets. As you can see in the table I shared earlier, the ZWE column shows multiple zero values for many of the datasets. Any ideas? If it will be helpful then I can share those MTX files as well.
Another point: Simply removing the nodes might not be a great idea. I would rather reset 0 values to a minimum non-zero value in the matrix. What do you think about it? My concern is the removal of vertices might disrupt the index of cells and cause downstream issues. Or should this problem be handled upstream when creating the KNN matrix?
Best regards, Parashar
@parashardhapola Sorry, my initial response was a little hasty. You are correct, a zero-weight matrix element does not in itself lead to NaN values. If I am not mistaken, the necessary condition for this behavior is having an explicit zero-weight element in a zero-sum row.
I agree with you about removing nodes. Indeed, the "fix" I tested yesterday only involved removing matrix elements (i.e., connections), not rows/columns (i.e., nodes). You could consider that as being part of kNN matrix formation. I expect this wouldn't lead to downstream issues, but of course that depends on your downstream application.
That said, I also agree with @pitsianis: this is something we could check for and address in a pre-processing step. We may update the code soon if we can find the time.
@pitsianis My run was on my laptop, which has an 8-core Intel i7-10875H with hyperthreading. I would caution that I was running all sorts of other programs in the background. I did, however, notice some possible issues with scalability while testing the port to OpenCilk (yet to be merged). I wanted to take a closer look using the Cilkscale tool but haven't found the time yet.
Thanks, @ailiop, and others for your help. I will now close this issue.
@pitsianis: I will drop you an email on the address I found here: http://users.auth.gr/pitsiani/#contact There I will able to describe my application in proper detail.
A quick note to others who land on this issue in the future: you can resolve the issue by removing edges with zero weights in the matrix. Please see @ailiop's reply above to know why zero-weight edges can sometimes lead to this issue.
Best regards, Parashar
Hi,
I get no values in the output file. I have pasted the log below. Do have an idea why does it show
nan
for error?Thanks, Parashar