cmu-phil / tetrad

Repository for the Tetrad Project, www.phil.cmu.edu/tetrad.
GNU General Public License v2.0
402 stars 110 forks source link

Recode LiNGAM. #61

Closed jdramsey closed 5 years ago

jdramsey commented 8 years ago

See Issue #60.

jdramsey commented 8 years ago

First FastICA needs to be recoded and recalibrated. Then we need to get Matlab code for LiNGAM and calibrate a Java version of that to the Matlab code. We need an actual unit test of this so it doesn't break.

jdramsey commented 7 years ago

Kun has a better idea.

jdramsey commented 5 years ago

OK, I am looking at LiNGAM. After some work I'm getting pretty good results orienting a 3-variable collider model, which wasn't happening before. I went back to the original paper and once again tried to put the code in line with that paper. I think I did a better job this time. This is a trace. X2 is the child; the order of the variables here is not randomized; I'll try that next.

Here is the paper.

shimizu06a.pdf

Here is the trace (for the algorithm in Section 4, algorithm A:

Simulating dataset #1 W = V1 V2 V3 -0.9724 0.0972 -0.4970 -0.5890 -0.0349 0.9793 -0.4109 1.1341 0.7451

30.992942547104565 [0, 1, 2] 2.9312993708222947 [0, 2, 1] perm1 = [0, 2, 1] WTilde before normalization = V1 V2 V3 -0.9724 0.0972 -0.4970 -0.4109 1.1341 0.7451 -0.5890 -0.0349 0.9793

WPrime = V1 V2 V3 1.0000 -0.1000 0.5111 -0.3623 1.0000 0.6570 -0.6014 -0.0357 1.0000

BHat = V1 V2 V3 0.1000 -0.5111 0.3623 -0.6570 0.6014 0.0357

0.4942973056906238 [0, 1, 2] 0.924698559828673 [0, 2, 1] BTilde = V1 V2 V3 -0.5111 0.1000 0.6014 0.0357 0.3623 -0.6570

perm2 = [0, 2, 1] [X1, X2, X3] /knowledge addtemporal

1 X1 2 X3 3 X2

forbiddirect

requiredirect graph Returning this graph: Graph Nodes: X1;X2;X3

Graph Edges:

  1. X1 --> X2
  2. X3 --> X2

Graph Attributes: BIC: 698.522617

Graph Node Attributes: BIC: [X1: 200.177757;X2: 238.222708;X3: 260.122153]

I found some index errors and other such silly errors in the code. (Sorry this was coded very badly originally, not by me. I had fixed a lot but not everything apparently.) As you can see, if W is right (that comes from ICA), WTilde is doing the right thing (permuting the rows of W so that the biggest guys are along the diagonal), WPrime is right (normalizing each row by its diagonal element), BHat is right (I - WPrime), and BTilde is right (permuting both rows and column s of BHat so that the elements in the lower triangle squared is largest). So that part of the algorithm is right.

So then it comes down to ICA. The problem with ICA is that it's representing data internally in the "European" way, with variables as rows and cases as columns, and Tetrad has that flipped. Some transpositions were done internally to try to adjust for this. I took all of those out, so it just does the European idea. I'll document that. So to use it you need to first transpose the data, which is fine; transposing doesn't make a copy of the data, so it's not expensive. So now all the matrices should be in the right orientation.

I just tried scrambling the order of the variables in the dataset; it still has pretty good signal but does screw up more often then when the variables are unscrambled. I think that's a problem with ICA. I'll try to think if there's any bias in ICA.

jdramsey commented 5 years ago

Current performance, compared to FASK.

N = 1000 10 nodes, 10 edges, DAG Linear Fisher simulation using non-Gaussian errors. Columns in data set randomly reordered.

For ICA:

Max iterations = 1000 Deflation algorithm Log cosh function Tolerance 1e-6.

Algorithms:

  1. LiNGAM (Linear Non-Gaussian Acyclic Model
  2. FASK using Fisher Z test

Graphs are being compared to the true DAG.

AVERAGE STATISTICS

All edges

Sim Alg sampleSize penaltyDiscount alpha thresholdAlpha AP AR AHP AHR F1All ExactlyRight E 1 1 1000.00 1.00 0.92 1.00 0.86 0.93 0.80 0.30 2.14 1 2 1000.00 1.00E-3 1.00 1.00 0.99 0.99 0.99 0.90 0.01

There's clearly a signal for LiNGAM; it does pretty well. But it's much, much slower, and this is only 10 nodes. It probably can't go much more than 12 nodes. Also, it's not as accurate as FASK.

@cg09

jdramsey commented 5 years ago

New clue. Raising coefficient range to +/-(.5, 1.2). Now (our) LiNGAM and FASK are very close in performance. So (our) LiNGAM just doesn't do well with smaller coefficients. (Or, FASK does worse with larger coefficients.)

10 nodes, 10 edges, random DAG 10 runs averaged Linear Fisher simulation using non-Gaussian errors. Columns in data set randomly reordered.

For ICA:

Max iterations = 1000 Deflation algorithm Log cosh function Tolerance 1e-6.

Algorithms:

  1. LiNGAM (Linear Non-Gaussian Acyclic Model
  2. FASK using Fisher Z test

Graphs are being compared to the true DAG.

AVERAGE STATISTICS

All edges

Sim Alg sampleSize penaltyDiscount alpha thresholdAlpha AP AR AHP AHR E 1 1 1000.00 1.00 0.85 1.00 0.74 0.86 2.40 1 2 1000.00 1.00E-3 0.94 0.91 0.75 0.73 0.01

cg09 commented 5 years ago

Since there is a kind of trade-off between estimated associations and sample sizes, and LinGam famously requires large sample sizes for accuracy, this is not surprising.

Clark

On Fri, Aug 23, 2019 at 5:10 PM Joseph Ramsey notifications@github.com wrote:

New clue. Raising coefficient range to +/-(.5, 1.2). Now (our) LiNGAM and FASK are very close in performance. So (our) LiNGAM just doesn't do well with smaller coefficients. (Or, FASK does worse with larger coefficients.)

10 nodes, 10 edges, random DAG 10 runs averaged Linear Fisher simulation using non-Gaussian errors. Columns in data set randomly reordered.

For ICA:

Max iterations = 1000 Deflation algorithm Log cosh function Tolerance 1e-6.

Algorithms:

  1. LiNGAM (Linear Non-Gaussian Acyclic Model
  2. FASK using Fisher Z test

Graphs are being compared to the true DAG.

AVERAGE STATISTICS

All edges

Sim Alg sampleSize penaltyDiscount alpha thresholdAlpha AP AR AHP AHR E 1 1 1000.00 1.00 0.85 1.00 0.74 0.86 2.40 1 2 1000.00 1.00E-3 0.94 0.91 0.75 0.73 0.01

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/61?email_source=notifications&email_token=AD4Y3OIQKCYALMHLWIOUQ4LQGBG4FA5CNFSM4BUZFV5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5BK2UI#issuecomment-524463441, or mute the thread https://github.com/notifications/unsubscribe-auth/AD4Y3OMAWF3TSGJIY4AK2U3QGBG4FANCNFSM4BUZFV5A .

jdramsey commented 5 years ago

Here's an improvement. Same conditions; I've added RSkew (which does terribly here, not sure why) and R3 (which soars):

Algorithms:

  1. LiNGAM (Linear Non-Gaussian Acyclic Model
  2. R3, entropy based pairwise orientation with initial graph from Fast adjacency search (FAS) using Fisher Z test
  3. RSkew with initial graph from Fast adjacency search (FAS) using Fisher Z test
  4. FASK using Fisher Z test

Graphs are being compared to the true DAG.

AVERAGE STATISTICS

All edges

  Sim  Alg  penaltyDiscount  alpha    AP    AR   AHP   AHR        E
    1    1             1.00      *  0.94  1.00  0.92  0.97     1.66
    1    2                *      *  0.97  0.94  0.97  0.94     0.04
    1    3                *      *  0.97  0.94  0.51  0.49  7.00E-3
    1    4                *   0.01  0.93  0.97  0.91  0.95  8.30E-3

Here, LiNGAM does pretty well. I've run this several times and it does pretty well each time. 10 odes, 10 edges, N = 1000, variables randomly reordered each time, coefficients in +/-(0.2, 0.7), average of 10 runs, so the same problem as before.

I'm not sure what all is different, but I did change a detail in the data generating process. Before, standard deviations were being chosen (erroneously) for each sample, randomly. Now I pick a standard deviation and (correctly) apply it to the entire column. I've also played with various ways of doing the LiNGAM procedure--e.g. minimizing the upper triangle versus maximizing the lower triangle (they should be the same!). I've played with the quantity being minimized or maximized, though the suggested c^2 seems to work the best.

Sometimes LiNGAM does as well as FASK here, but usually it's a little off still. It's possible this is just a problem with running FASK with the 10 node 10 edge regime; FASK usually more than a couple of edges wrong in this case. The order is being fixed, but maybe these mistakes are within the causal order. Maybe the original LiNGAM adjacency search should be explored.

Still, FASK beats LiNGAM to an extent on accuracy but handily on elapsed time.

But--yay! Progress!

Here are some more runs:

  Sim  Alg  penaltyDiscount  alpha    AP    AR   AHP   AHR        E
    1    1             1.00      *  0.94  1.00  0.89  0.94     1.57
    1    2                *      *  1.00  0.99  1.00  0.99     0.04
    1    3                *      *  1.00  0.99  0.48  0.47  8.80E-3
    1    4                *   0.01  0.94  0.99  0.93  0.98  8.00E-3

  Sim  Alg  penaltyDiscount  alpha    AP    AR   AHP   AHR        E
    1    1             1.00      *  0.95  0.99  0.92  0.94     1.58
    1    2                *      *  0.99  0.95  0.99  0.95     0.04
    1    3                *      *  0.99  0.95  0.44  0.42  7.00E-3
    1    4                *   0.01  0.95  0.95  0.88  0.89  7.60E-3

  Sim  Alg  penaltyDiscount  alpha    AP    AR   AHP   AHR        E
    1    1             1.00      *  0.91  0.99  0.82  0.89     1.61
    1    2                *      *  0.98  0.97  0.98  0.97     0.04
    1    3                *      *  0.98  0.97  0.44  0.44  7.30E-3
    1    4                *   0.01  0.91  0.97  0.89  0.95  8.50E-3

  Sim  Alg  penaltyDiscount  alpha    AP    AR   AHP   AHR        E
    1    1             1.00      *  0.95  1.00  0.91  0.96     1.56
    1    2                *      *  1.00  0.98  1.00  0.98     0.04
    1    3                *      *  1.00  0.98  0.51  0.50  7.00E-3
    1    4                *   0.01  0.98  0.98  0.98  0.98  8.60E-3
jdramsey commented 5 years ago

This is the ICA book, in case anyone wants to spend some delightful hours reading by the fire.

https://www.cs.helsinki.fi/u/ahyvarin/papers/bookfinal_ICA.pdf

jdramsey commented 5 years ago

Update. I discovered the alpha ("a") parameter for Fast ICA! I knew if was there and that it took values between 1 and 2, and I tried 1 and I tried 2, unsuccessfully, but I didn't try 1.1! Here is 1.1:

AVERAGE STATISTICS

All edges

  Sim  Alg  penaltyDiscount  alpha    AP    AR   AHP   AHR        E
    1    1             1.00      *  0.97  1.00  0.95  0.97     1.67
    1    2                *      *  0.99  1.00  0.99  1.00     0.04
    1    3                *      *  0.99  1.00  0.50  0.50  7.20E-3
    1    4                *   0.01  0.96  1.00  0.94  0.98  8.50E-3

This is stunning! Details. This is N = 1000 as usual, with 10 nodes, 10 edges, DAG, with variables randomly reordered, alpha = 1.1. I tried several runs of this; it almost always comes out great. I can try some more values and come up with a good default for that.

Caveats: (a) This is with the Log Cosh function (tahn??). Using the exp formula in the book decimates accuracy. (b) This is with the deflationary procedure; the parallel procedure needs to be cleaned up--in any case, currently it's way inferior.

jdramsey commented 5 years ago

Sorry, the algorithms for that last table were the same as before. Here it is again with the algorithms:

Algorithms:

  1. LiNGAM (Linear Non-Gaussian Acyclic Model
  2. R3, entropy based pairwise orientation with initial graph from Fast adjacency search (FAS) using Fisher Z test
  3. RSkew with initial graph from Fast adjacency search (FAS) using Fisher Z test
  4. FASK using Fisher Z test

Graphs are being compared to the true DAG.

AVERAGE STATISTICS

All edges

  Sim  Alg  penaltyDiscount  alpha    AP    AR   AHP   AHR        E
    1    1             1.00      *  0.97  1.00  0.95  0.97     1.67
    1    2                *      *  0.99  1.00  0.99  1.00     0.04
    1    3                *      *  0.99  1.00  0.50  0.50  7.20E-3
    1    4                *   0.01  0.96  1.00  0.94  0.98  8.50E-3

So now LiNGAM, FASK, and R3 are pretty much on par (Yay!!!). RSkew sucks (Boo!). Not sure why.

cg09 commented 5 years ago

Try it with different non-Gaussian noises?

On Mon, Aug 26, 2019 at 5:16 PM Joseph Ramsey notifications@github.com wrote:

Update. I discovered the alpha ("a") parameter for Fast ICA! I knew if was there and that it took values between 1 and 2, and I tried 1 and I tried 2, unsuccessfully, but I didn't try 1.1! Here is 1.1:

AVERAGE STATISTICS

All edges

Sim Alg penaltyDiscount alpha AP AR AHP AHR E 1 1 1.00 0.97 1.00 0.95 0.97 1.67 1 2 0.99 1.00 0.99 1.00 0.04 1 3 0.99 1.00 0.50 0.50 7.20E-3 1 4 0.01 0.96 1.00 0.94 0.98 8.50E-3

This is stunning! Details. This is N = 1000 as usual, with 10 nodes, 10 edges, DAG, with variables randomly reordered, alpha = 1.1. I tried several runs of this; it almost always comes out great. I can try some more values and come up with a good default for that.

Caveats: (a) This is with the Log Cosh function (tahn??). Using the exp formula in the book decimates accuracy. (b) This is with the deflationary procedure; the parallel procedure needs to be cleaned up--in any case, currently it's way inferior.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/61?email_source=notifications&email_token=AD4Y3OITPZBF2CAAJ2RVJ5LQGRB3HA5CNFSM4BUZFV5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5FWUPA#issuecomment-525036092, or mute the thread https://github.com/notifications/unsubscribe-auth/AD4Y3ONKPN4H5OAVH2U7O43QGRB3HANCNFSM4BUZFV5A .

jdramsey commented 5 years ago

Update: The EXP function (g(x) = y exp(-(y^2) / 2)) now works as promised, at least this once. I have:

(a) Taken the alpha out of the formula, as suggested by the book.

(b) Increased the maximum number of iterations to 1500.

(c) Decreased the tolerance to 1e-4. (It was 1e-2.)

Same setup as last report. N = 1000 as usual, with 10 nodes, 10 edges, DAG, with variables randomly reordered, alpha = 1.1.

Algorithms:

  1. LiNGAM (Linear Non-Gaussian Acyclic Model
  2. R3, entropy based pairwise orientation with initial graph from Fast adjacency search (FAS) using Fisher Z test
  3. RSkew with initial graph from Fast adjacency search (FAS) using Fisher Z test
  4. FASK using Fisher Z test

Graphs are being compared to the true DAG.

AVERAGE STATISTICS

All edges

  Sim  Alg  penaltyDiscount  alpha    AP    AR   AHP   AHR        E
    1    1             1.00      *  0.98  0.98  0.94  0.94     1.67
    1    2                *      *  1.00  0.94  1.00  0.94     0.04
    1    3                *      *  1.00  0.94  0.54  0.52  5.90E-3
    1    4                *   0.01  0.99  0.96  0.99  0.96  7.10E-3

This is stunning (as usual; I'm getting used to being stunned.) EXP has to date been recalcitrant.

I will expose these parameters when I get a chance.

More runs:

  Sim  Alg  penaltyDiscount  alpha    AP    AR   AHP   AHR        E
    1    1             1.00      *  0.95  1.00  0.90  0.95     1.65
    1    2                *      *  0.98  0.99  0.98  0.99     0.04
    1    3                *      *  0.98  0.99  0.48  0.49  7.80E-3
    1    4                *   0.01  0.91  0.99  0.91  0.99  9.20E-3

  Sim  Alg  penaltyDiscount  alpha    AP    AR   AHP   AHR        E
    1    1             1.00      *  0.97  1.00  0.95  0.98     1.66
    1    2                *      *  0.98  0.96  0.98  0.96     0.04
    1    3                *      *  0.98  0.96  0.59  0.58  7.20E-3
    1    4                *   0.01  0.96  0.98  0.94  0.95  7.60E-3

This looks reliable.

jdramsey commented 5 years ago

Update: The PARALLEL code for the log cosh case is now working also, so I got rid of the code for the exp case and just used the g function in the book (which switches between "log cosh" (tanh(y)) and "exp" (y exp(-y^2/2)), so covers both cases).

Here's the log cosh case:

  Sim  Alg  penaltyDiscount  alpha    AP    AR   AHP   AHR        E
    1    1             1.00      *  0.99  1.00  0.99  1.00     1.48
    1    2                *      *  0.99  0.93  0.99  0.93     0.04
    1    3                *      *  0.99  0.93  0.53  0.50  8.40E-3
    1    4                *   0.01  0.91  0.97  0.89  0.95  8.50E-3

Here's the "exp case:

All edges

  Sim  Alg  penaltyDiscount  alpha    AP    AR   AHP   AHR        E
    1    1             1.00      *  0.97  1.00  0.97  1.00     1.47
    1    2                *      *  0.97  0.96  0.97  0.96     0.04
    1    3                *      *  0.97  0.96  0.53  0.52  6.70E-3
    1    4                *   0.01  0.91  0.96  0.90  0.95  8.40E-3

As usual, LiNGAM is the first algorithm. So these are working now, blessing of blessing. This means all of the options for this implementation of ICA are now working. Nothing was done here except to adjust the parameters as indicated earlier. Tolerance is 1e-6 (I guess that's a little lower); max iterations is 2000; a is 1.1, as before.

I did make a change to LiNGAM. There are two optimizations by evil permutation. The first tries to maximizing the absolute coefficients along the diagonal by permuting the rows of W. That was being done by minimizing the sum of 1 / abs(c) for diagonal elements c. I just directly maximize the sum of |c|. The second optimization tries to make the upper triangle as small as possible. I just maximize again the sum of |c| for lower triangular elements. This seems to work better. So far I see no point to doing it the original way. I might change my mind on that; it's easy to change up.

jdramsey commented 5 years ago

OK, endgame for this. We have a choice of deflation or parallel, and we have a choice of log cosh (tanh) or exp. Just playing around with it, with a low tolerance parallel with exp seems (??) to be just about perfect, so I'll choose that. That also has the advantage of taking the alpha parameter out of the interface so the use doesn't have to tune that. I am configuring the algorithm for these choices.

I think it's fine to expose the max iterations and tolerance parameters to the users; they have good defaults, but the user may want to adjust those.

That leaves the penalty discount parameter. That's being exposed because we're using FGES for the pruning step. It seems to work fine. It works so well that I'm not going to code up the original pruning step unless someone complains for principled reasons.

I think that's it. LiNGAM is now fixed, SFAIK.

jdramsey commented 5 years ago

I'm treating the issue with RSkew as a separate, unconnected issue. @cg09 In any case, RSkew hasn't changed. I've noticed it is somewhat less general than other non-Gaussian algorithms in any case.

jdramsey commented 5 years ago

Also, @cg09 I don't think the above low accuracies with LiNGAM (which are fixed I think) had to do with low sample size. @cg09 When you said that, the first thing I tried was to raise the sample size; it didn't work. I think there were some coding mistakes, and the parameter choices were unfortunate.

jdramsey commented 5 years ago

Yay, we can finally close this issue! It's in development and has been checked!