Evaluate the advantages of using Gumbel Softmax in pooling

Jai2500 commented 2 years ago

We need to understand fully what we gaining out of sampling. Broadly, it's the ability to have fewer edges and preserved, crisp, graph structures.

AnimeshSinha1309 commented 2 years ago

Minutes of Meeting

Learnings:

Gumbel Softmax as function takes in as input a categorical distribution and outputs a sample from that distribution. It does not care about how the categorical distribution was obtained.
Our method in comparison to DiffPool is memory efficient and the goal is reducing space complexity. It's a mathematically motivated (and almost equal in expectation) solution.
Problem: In DiffPool, when we cluster, the edge list becomes $K \times K$ because now there is an edge between all clusters. Soln: In our case, the number of edges CANNOT increase the initial amount. (Prove this!)

Potential Problems:

We will still need a $N \times K$ matrix for the probability (that we will have to put the softmax to). Potential soln: Use sparse activation or Switch Transformers or somehow basically reduce this $N \times K$ to something else.
In expectation we will need $1/p_i^N$ runs to approximate the diff pool thing in expectation. Thus for multiple pooling we will need $1/p_i^N 1/p_j^K 1/p_k^M * ...$, if we cluster from N to K to M to ... (Prove this!) Potential Soln: This may actually be okay because in some cases $p_i$ will be very tiny so anyways its consideration will not affect so much in Diffpool, so we may not actually need these many runs to properly approximate the true value as we at least consider the highly influential ones.
Convergence will be an issue because now to get the equivalent gradients of Diffpool, we will need more runs (will we? because backward pass is still softmax, we should tho?). Can it also influence the entropy regularization problem that diff pool has initially that all nodes seem to get the same values initially?

Potential Things:

Because we are memory efficient, we can scale up by making the model deeper which may improve the performance. This can be a huge impact.
Because our S matrix is now binary, the adjacency matrix of the clusters will also be binary and thus we can save more space through using the correct dtypes.
The learnt node features should become crisper (like more quantized) instead of being a fuzzed representation of all neighbor nodes, the graph structure will also be crisper instead of being a superposition of all possible layouts.

AnimeshSinha1309 commented 2 years ago

We are pretty sure that we can go deeper, so closing this issue and proceeding with development.

bayesmech / research-gnn-pooling