FenTechSolutions / CausalDiscoveryToolbox

Package for causal inference in graphs and in the pairwise settings. Tools for graph structure recovery and dependencies are included.
https://fentechsolutions.github.io/CausalDiscoveryToolbox/html/index.html
MIT License
1.08k stars 198 forks source link

FSGNN strange matrix multiplication #46

Closed ArnoVel closed 4 years ago

ArnoVel commented 4 years ago

Hello, I played with the FSGNN example avalaible, modifying very little pieces of the code (most of it, especially the NN-related part, is the same as the original). However after training, the whole thing crashes because of some matrix multiplication. You can find a screen capture of the error message (inside the jupyter notebook) below. Screenshot from 2019-10-01 22-27-26

I think you meant to write matrix_results = matrix_results.dot(matrix_results.T) or something like that? ( or matrix_results = matrix_results @ matrix_results.T works too i believe); it would only make sense, as (2,11) (11,2) (2,11) is a valid matrix multiplication operation.. Maybe A*B is now performing an element-wise product ?

Regards, Arno V.

ArnoVel commented 4 years ago

Additionnally, I found that allowing GPU computation through the snippet you see here, as well as putting SETTINGS.NJOBS=16 as in your example does not lead to faster training and testing (approx 10 mins for the training+test on my GEFORCE GTX 1060 6GB + Intel i7-8700 CPU @ 3.20GHz)

Would there be an obvious reason as to why this happens? I'm running everything on Ubuntu 18.04 at the moment.

ArnoVel commented 4 years ago

Lastly, is there a reason you have this import

 from .FSGNN import FSGNN_model

at the last line (line 146) of model.py ? I'm trying to change everything myself, but for some reason this line still keeps the whole algorithm from running..

Thanks!

ArnoVel commented 4 years ago

Reporting after running the tweaked version: I only have 1 CPU, and setting NJOBS=16 as in the example had the (un)expected effect of halving the iterations/sec on the whole training run (went from 16 it/s to 8 it/s). Why would this be an expected behavior?

Finally, I am noticing that the iter / sec on FSGNN is the same as in my past CGNN experiments. I have only a limited understanding of pytorch performances for shallow nets, especially depending on the input size when it is relatively small. Do you think it's a normal thing to be stuck at 15 it/s on both CGNN (for TCEP) and and FSGNN on Sachs ? In both cases, I set train_epochs=1000 and test_epochs=500. In Sachs in the base configuration I expect 3*11 runs total (am I correct? nruns=3 for each node and 11 nodes in the graph) and therefore approximately 33 times 1min40 = 56 minutes to run the whole thing. On TCEP I have roughly the same, leading to 100 times 1min40 = 2h.

Any information would be appreciated, Thanks, A.V

diviyank commented 4 years ago

Thanks for bringing this out ; I will check more in details on the implementation of FSGNN. and on performance. I think that it is linked to the Dataset feed to the models. I tried to have it more general, using pytorch tools, but i think performance took a hit. When on CPU, it is better to keep NJOBS at 1, because pytorch is already parallelizing on the CPU. On GPU, It depends on the GPU usage, which you can check with the command nvidia-smi, if the GPU usage it not at 100%, you can add more jobs on your GPU. Otherwise, it can be beneficial not to oversaturate your GPUs with parallel jobs, thus reducing njobs.

Lastly, is there a reason you have this import

 from .FSGNN import FSGNN_model

at the last line (line 146) of model.py ? I'm trying to change everything myself, but for some reason this line still keeps the whole algorithm from running..

Thanks!

This import is to have the Pytorch model available for users as cdt.independence.graph.model.FSGNN

How does this import break your installation ?

Finally, I am noticing that the iter / sec on FSGNN is the same as in my past CGNN experiments. I have only a limited understanding of pytorch performances for shallow nets, especially depending on the input size when it is relatively small. Do you think it's a normal thing to be stuck at 15 it/s on both CGNN (for TCEP) and and FSGNN on Sachs ? In both cases, I set train_epochs=1000 and test_epochs=500. In Sachs in the base configuration I expect 3*11 runs total (am I correct? nruns=3 for each node and 11 nodes in the graph) and therefore approximately 33 times 1min40 = 56 minutes to run the whole thing. On TCEP I have roughly the same, leading to 100 times 1min40 = 2h.

It seems a bit slow to me... what is your code version ? I will investigate.

diviyank commented 4 years ago

Hello again,

After some tests, I obtain similar computational performance, which is quite logical actually : I didn't notice that you are testing on the sachs dataset: which is ~7000 samples. FSGNN and (C)GNN use all the MMD metric to optimize their loss, which is quadratic. I think that you didn't set a batch_size, so the MMD computation is consuming a lot.

What kind of dataset are you using to obtain your error at the beginning of the issue ?

Best, Diviyan

diviyank commented 4 years ago

To come back at the matrix multiplication, the operation should be correct; what is done is the following:

Best, Diviyan

ArnoVel commented 4 years ago

What kind of dataset are you using to obtain your error at the beginning of the issue ?

Everything related to FSGNN came from the example I linked, which is the Sachs+FSGNN example. I changed so little about the original code (mostly cosmetic, playing around with networkx) that I'm surprised you don't have a similar issue.

What I did was almost as banal as

  1. installing CDT
  2. run the jupyter notebook you provide
  3. see the error I reported

About the correctness of the operation and my specific error: as I understand it, matrix_results should be (11,11). However it is (2,11) for some reason. On top of that, why would one rather have a product of the scores? I would rather average (arithmetic mean) the two independence scores, therefore resulting in

matrix_results += matrix_results.T

rather than the current

matrix_results *= matrix_results.T

On top of that, the sum operation is coherent with what is usually understood as the symmetrization of some matrix M ( which is 0.5*(M+M.T) ).

Any clarification would be appreciated on the second point, Thanks!

diviyank commented 4 years ago

I'll try to investigate further on that first point.

The second is a design by choice: if a variable is needed to generate another in one way but absolutely not in the other, we can suppose that the variables are not directly connected in the graph. I think that back then we tried both and took the best performing solution, but I'm not sure.

ArnoVel commented 4 years ago

I'm not sure how 'dependency' is measured in the case of FSGNN, I'll check. But if any notion of directedness is embedded in the algorithm, then having a low score in some direction (A->B) but a high score in the other (B->A) would make sense.

Update on the bug: The bug somehow stopped occurring. I do not know why, as I just changed the *= operation back and forth...

diviyank commented 4 years ago

With FSGNN there is no notion of directeness of the edges, only bi-directional symmetric dependencies. I wasn't able to reproduce the bug either ; I will be closing this issue, don't hesitate to reopen it if it arises again. Best, Diviyan