NCC code not coherent with the paper?

FenTechSolutions / CausalDiscoveryToolbox

Package for causal inference in graphs and in the pairwise settings. Tools for graph structure recovery and dependencies are included.

https://fentechsolutions.github.io/CausalDiscoveryToolbox/html/index.html

MIT License

1.13k stars 199 forks source link

NCC code not coherent with the paper? #38

Open ArnoVel opened 5 years ago

ArnoVel commented 5 years ago

Hi, This could be obviously due to misunderstandings, but I thought the NCC framework should be built from 2 MLPs. From my understanding, MLPs typically do not use convolutions. However, the RCC code I found in CDT is the following

        self.conv = th.nn.Sequential(th.nn.Conv1d(2, n_hiddens, kernel_size),
                                     th.nn.ReLU(),
                                     th.nn.Conv1d(n_hiddens, n_hiddens,
                                                  kernel_size),
                                     th.nn.ReLU())

In addition, the original paper recommends to enforce 1 - NCC(x1,x2) = NCC(x2,x1) (where [x1,x2] is a n-sample of pairs. They do this by having a composite output .5*(1- NCC(x2,x1) + NCC(x1,x2)) I am not sure which lines are responsible for this if they exist?

Any help would be greatly appreciated. Regards, Arno V.

diviyank commented 5 years ago

Hi, The code is actually inspired from the author's code actually ; the convolutions are actually the embedding layers (they are not Linear layers), check on the end of p.4 : 'φ(xij,yij)is the mean embedding of the empirical distribution 1/mi∑δ(xij,yij)'. But I do understand that it might be a bit cryptic, and I might be also be wrong ; we've tested a lot of (MLP-based) configurations and this one seems to be the one that actually learns to classify Your other remark is correct; I should add the composite output to the model in the next release!

Best regards, Diviyan

ArnoVel commented 5 years ago

Hi again, Thanks for the detailed answer. My original question concerning point 1 could be updated as follows: why is the embedding network/ embedding layer better framed as Conv1D's and not the 'basic' combination of linear + nonlinearity? Is this from experiments, or is this also from the paper verbatim?

I do understand the idea of modelling the Kernel Mean Embedding's function through a MLP; it is the specific way of building that MLP which I don't understand!

Thanks again, Arno V.

diviyank commented 5 years ago

Hi again, I do understand your issue; I actually had the same issue, before convincing myself that it was the correct thing. I'll get into it next week, and I'll get back to you !

ArnoVel commented 5 years ago

Hi, Would it be possible to have some kind of update? Thanks!

ArnoVel commented 5 years ago

Hi again, I'm reading the NCC code more carefully now, and I noticed something odd. You made the choice to use

criterion = th.nn.BCEWithLogitsLoss()

and therefore your network only output logits

        self.conv = th.nn.Sequential(th.nn.Conv1d(2, n_hiddens, kernel_size),
                                     th.nn.ReLU(),
                                     th.nn.Conv1d(n_hiddens, n_hiddens,
                                                  kernel_size),
                                     th.nn.ReLU())
        # self.batch_norm = th.nn.BatchNorm1d(n_hiddens, affine=False)
        self.dense = th.nn.Sequential(th.nn.Linear(n_hiddens, n_hiddens),
                                      th.nn.ReLU(),
                                      th.nn.Linear(n_hiddens, 1)
                                      )

However, in the training loop, the way you compute accuracy is

acc = th.where(th.cat(output, 0).data.cpu() > .5,
                                   th.ones(len(output)),
                                   th.zeros(len(output))) - th.cat(labels, 0).data.cpu()

If I'm correct, this means you're thresholding the logits directly, as output is obtained directly as

[self.model(m.t().unsqueeze(0)) for m in batch]

and therefore a list of logits... Usually we threshold the probabilities and not the logits, don't we? Just my 2 cents!

diviyank commented 4 years ago

I've asked the author about it, I hope he gets back to me

ArnoVel commented 4 years ago

Thanks, Additionally, I've been doing some experiments and it seems that learning FCM structure from clouds of points is highly nontrivial, as depending on the FCM, It occurred very often that the training accuracy (on 10k, 20k synthetic datasets from some cause, mechanism, noise distribution) could be as high as 0.97, but the validation accuracy (on some held out datasets from the same distribution, as many as 4000 datasets) would stay at random chance !!

I would be interested about the author's opinion on this fact, If you ever get the chance.

siddsuresh97 commented 4 years ago

Hey, Is there any update on this issue?

diviyank commented 4 years ago

Hi, I haven't gotten any response from the original author. I will dig into the algorithm next month to try close this issue. Feel free to make a pull request if you find a solution to the algorithm