cmu-phil / tetrad

Repository for the Tetrad Project, www.phil.cmu.edu/tetrad.
GNU General Public License v2.0
407 stars 111 forks source link

Make test/score that will work for algebraically defined nonlinear models. #1669

Open uvnikgupta opened 1 year ago

uvnikgupta commented 1 year ago

Loading the attached csv throws the following exception:

Infer demiliter for file: 20_nodes_normal.csv Exception in thread "AWT-EventQueue-0" java.lang.NoSuchMethodError: java.nio.ByteBuffer.clear()Ljava/nio/ByteBuffer; at edu.pitt.dbmi.data.reader.util.TextFileUtils.inferDelimiter(TextFileUtils.java:135) at edu.cmu.tetradapp.editor.LoadDataSettings.getInferredDelimiter(LoadDataSettings.java:882) at edu.cmu.tetradapp.editor.LoadDataSettings.basicSettings(LoadDataSettings.java:503) at edu.cmu.tetradapp.editor.LoadDataDialog.showDataLoaderDialog(LoadDataDialog.java:165) at edu.cmu.tetradapp.editor.LoadDataAction.actionPerformed(LoadDataAction.java:91) at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2022) at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2348) at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402) at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259) at javax.swing.AbstractButton.doClick(AbstractButton.java:376) at javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:842) at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:886) at java.awt.Component.processMouseEvent(Component.java:6539) at javax.swing.JComponent.processMouseEvent(JComponent.java:3324) at java.awt.Component.processEvent(Component.java:6304) at java.awt.Container.processEvent(Container.java:2239) at java.awt.Component.dispatchEventImpl(Component.java:4889) at java.awt.Container.dispatchEventImpl(Container.java:2297) at java.awt.Component.dispatchEvent(Component.java:4711) at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4904) at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4535) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4476) at java.awt.Container.dispatchEventImpl(Container.java:2283) at java.awt.Window.dispatchEventImpl(Window.java:2746) at java.awt.Component.dispatchEvent(Component.java:4711) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:760) at java.awt.EventQueue.access$500(EventQueue.java:97) at java.awt.EventQueue$3.run(EventQueue.java:709) at java.awt.EventQueue$3.run(EventQueue.java:703) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74) at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:84) at java.awt.EventQueue$4.run(EventQueue.java:733) at java.awt.EventQueue$4.run(EventQueue.java:731) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74) at java.awt.EventQueue.dispatchEvent(EventQueue.java:730) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:205) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93) at java.awt.EventDispatchThread.run(EventDispatchThread.java:82) 20_nodes_normal.csv

jdramsey commented 1 year ago

Actually your file didn't come through; you may need to zip it before attaching it (I've found)...

jdramsey commented 1 year ago

One second I found your link...

jdramsey commented 1 year ago

Ah. It's not a covariance matrix. You can load it as tabular data--see the picture I took.

Screenshot 2023-07-26 at 3 26 00 PM
jdramsey commented 1 year ago

Hold on, sorry, you didn't actually say it was a covariance matrix. But huh, it loads for me..... can you tell me more about how you're trying to load it?

kvb2univpitt commented 1 year ago

@uvnikgupta What version of Java are you using?

uvnikgupta commented 1 year ago

Java version: openjdk version "1.8.0_332" OpenJDK Runtime Environment (Temurin)(build 1.8.0_332-b09) OpenJDK 64-Bit Server VM (Temurin)(build 25.332-b09, mixed mode)

I am launching the jar using : java -Xmx2G -jar tetrad-gui-7.4.0-launch.jar image image image

jdramsey commented 1 year ago

Thanks for the update. Sorry, I was multitasking yesterday. This is a bug we know about (thanks @kvb2univpitt). The issue (if you want to know) is that Oracle changed the implementation of the ByteBuffer class so that it's incompatible between version 1.8 and versions > 1.8. It's this bug:

https://www.morling.dev/blog/bytebuffer-and-the-dreaded-nosuchmethoderror/

except in your case it's the clear() method that's the problem and not the position() method. You're using OpenJDK 1.8, I'm guessing on a Linux box? (Actually can you confirm that?) What I'll do (sorry just trying different things here) is the casting they suggest in the article to see if it will work in OpenJDK1.8 for me. (It needs to work both for 1.8 and for > 1.8 unfortunately, which is the issue.) Unfortunately I'm on a Mac at the moment and the only JDK 1.8 I can get anymore is Amazon's, and it's not a problem there. When I get back home today I'll try installing OpenJDK 1.8 on my Windows laptop (I think I can still do that, though I can no longer get it from M$) and test it there. But really what I need to do is test it on Linux, using OpenJDK 1.8, and I don't have a Linux box currently.

If I made you a version (or maybe two versions) to test, would you be willing to try them out on your machine? That would help a lot.

uvnikgupta commented 1 year ago

@jdramsey, Thanks a lot for explaining the issue. I am using Widows 10. Yes, I am ok to try the test versions

jdramsey commented 1 year ago

Awesome--Let me grab the Mac version now and test it, and then I can download the Windows one later and test it there. Fingers crossed! We (well @kvb2univpitt) were thinking of rewriting that section of code without using ByteBuffer, but hopefully this fixes it without that effort.

jdramsey commented 1 year ago

Actually they're not providing any Mac options--it's in their selector but you only get Windows options in the list. I'm at the office right now but can do this later when I get home; my Windows laptop is there.

I just tested it using Amazon's Corretto 1.8 on Mac and it works there, though I suspect Amazon may have gone in and fixed the issue internally.

jdramsey commented 1 year ago

Oh hold on, they did have it! It's just that their dropdown was broken; I had to select "all" and then the Mac options showed up. I test it--it works! That gives me some confidence that it will work on Windows as well using the a Windows 1.8 download from this site, but I can test it later.

kvb2univpitt commented 1 year ago

The problem goes away if you use Java 11 and above.

jdramsey commented 1 year ago

@kvb2univpitt I am motivated to figure it out because we have users who are not in a position to grab a newer version of Java. I may have figured it out though--I'll let you know! I'm going to test it now on Windiows.

uvnikgupta commented 1 year ago

@kvb2univpitt I am motivated to figure it out because we have users who are not in a position to grab a newer version of Java. I may have figured it out though--I'll let you know! I'm going to test it now on Windiows.

I am one of those in that group :)

kvb2univpitt commented 1 year ago

@jdramsey We definitely need to get rid of the ByteBuffer. By "we" I mean "me".

jdramsey commented 1 year ago

@uvnikgupta @kvb2univpitt Could you both try to break this version? I.e., launch it, try to load a dataset...

https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/cmu-phil/tetrad-gui/7.4.0-SNAPSHOT/tetrad-gui-7.4.0-20230728.001143-5-launch.jar

If it works I will tell you what I did.

uvnikgupta commented 1 year ago

@uvnikgupta @kvb2univpitt Could you both try to break this version? I.e., launch it, try to load a dataset...

https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/cmu-phil/tetrad-gui/7.4.0-SNAPSHOT/tetrad-gui-7.4.0-20230728.001143-5-launch.jar

If it works I will tell you what I did.

Sure. On it :)

Tried different datasets and it seems to work pretty fine now 👍 Thanks for the quick fix

uvnikgupta commented 1 year ago

Tried a few more and data loading + Search works flawlessly. The only issue now is the the resulting graph is nowhere close to the actual graph :( I guess that is state of the existing discovery algorithms due to the nature of the problem.

jdramsey commented 1 year ago

I'm very curious what experience Kevin has. I compiled this under Corretto 1.8 and have no trouble running under 1.8 or 11 on my Mac, so if you have no trouble on Windows, I'll try under 11 under Windows.

Not sure what to say about the content. Maybe if you tell me the general nature of the problem and what you've tried I could comment?

uvnikgupta commented 1 year ago

I am loading the data and connecting to the search box. Then executing search using different algorithms. Finally comparing the result with the actual DAG. The data and the actual DAG is attached for your reference 20_nodes_normal.csv image

BTW, I encountered a Null pointer issue when I tried to use the "Regression" image

cg09 commented 1 year ago

Are these Gaussian variables? With what sample size?

On Thu, Jul 27, 2023 at 9:28 PM kelearin @.***> wrote:

I am loading the data and connecting to the search box. Then executing search using different algorithms. Finally comparing the result with the actual DAG. The data and the actual DAG is attached for your reference 20_nodes_normal.csv https://github.com/cmu-phil/tetrad/files/12189230/20_nodes_normal.csv [image: image] https://user-images.githubusercontent.com/20485662/256699118-c585c8fe-048a-4e90-bbe4-969c12ddf0b8.png

BTW, I encountered a Null pointer issue when I tried to use the "Regression" [image: image] https://user-images.githubusercontent.com/20485662/256699297-6ae9f1fb-2b24-46e3-ae23-fff1296432f0.png

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1669#issuecomment-1654840984, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OON3557TVEGSNGP7KLXSMISRANCNFSM6AAAAAA2ZBCWL4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

uvnikgupta commented 1 year ago

not able to attach my data generator .py file. So below is the formulae: "A1": "0.0", "A2": "0.0", "A3": "0.0", "A4": "0.0", "A5": "0.0", "A6": "0.0", "A7": "0.0", "A8": "0.0", "B1": 'data_2["A1"]2', "B2": 'data_2["A1"]', "C2": 'np.sqrt(np.abs(data_2["B1"]))', "C3": 'data_2["B1"] * data_2["B2"]', "D2": 'data_2["C2"]2 + data_2["C3"] - data_2["A2"]2', "C4": 'data_2["B2"]3', "D3": 'np.sqrt(np.abs(data_2["C4"]))', "B3": 'data_2["A4"]2 + data_2["A5"]', "C1": 'data_2["B3"]*2', "D1": 'np.round(np.mod(1000data_2["C1"], 10), 3)', "E1": 'np.abs(data_2["A3"])2/(data_2["D1"] + .001)', "F1": '2data_2["D2"] + data_2["D3"] - data_2["E1"]data_2["A6"] + 8*data_2["A7"]/data_2["A8"]' I add np.random.normal(loc=5, scale=1, size=self.size) to each of the variables above

jdramsey commented 1 year ago

They are not terribly Gaussian. By the way @uvnikgupta if you'd like to switch to email I'm happy. @cg09 if you load up the data that was sent in the version of Tetrad given above and use the Plot Matrix tool you can see the distributions of the variables.

uvnikgupta commented 1 year ago

They are not terribly Gaussian. By the way @uvnikgupta if you'd like to switch to email I'm happy. @cg09 if you load up the data that was sent in the version of Tetrad given above and use the Plot Matrix tool you can see the distributions of the variables.

yes, I can share my data generation python code then. Please DM me at

jdramsey commented 1 year ago

That's what I thought--nonlinear algebraic functions generated them...You know we were just thinking of how to incorporate this sort of nonlinear additivity into a fast score...

cg09 commented 1 year ago

What sort of "non-linear algebraic" functions?

On Thu, Jul 27, 2023 at 10:08 PM Joseph Ramsey @.***> wrote:

That's what I thought--nonlinear algebraic functions generated them...You know we were just thinking of how to incorporate this sort of nonlinear additivity into a fast score...

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1669#issuecomment-1654870838, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OJHDQAYRO4CSF42AHLXSMNJDANCNFSM6AAAAAA2ZBCWL4 . You are receiving this because you were mentioned.Message ID: @.***>

uvnikgupta commented 1 year ago

The formula simplifies to : 2D2 + D3 - E1A6 + 8*A7/A8

I hope you received the python scripts I shared. You can generate data of any size with that script. Just modify the size parameter in the instantiations of the DataGenerator class under if name == "main", create a data folder and run "python data_generator.py". Of course you have to pip install pandas, numpy and scipy.

On Fri, Jul 28, 2023 at 12:31 AM Joseph Ramsey @.***> wrote:

b1 = a2^2

==> log(b1) - 2 ln a2

b2 = a1

Singularity, you'll need to remove one of two columns or teach you algorithm to deal with it. But you can't use regression here in any form. (This is why the regression check is failing, above, BTW).

c2 = sqrt(abs(b1))

==> Hmmm... you need to check a symmetric function here of b1 to find the dependency.

c3 = b1 * b2

==>ln(c3) = ln(b1) + ln(b2)

c2^2 + c3 - a2^2

==> Logging won't help here for the entire function! But logging c2 and logging b2 separately would help if you knew to do that! Hmmm...

c4 = b2^3

==> ln(c4) = 3 * ln(b2)...no problem.

sqrt(|c4|)

==> Another symmetric function.

b3 = a4^2 + a5

==> Logging a4 separately would have helped.

c1 = b3^2

==> Logging solves this.

"D1": 'np.round(np.mod(1000data_2["C1"], 10), 3)',

Not sure how to describe this one in words yet, I'll come back to it.

==> NO HELP HERE! You need to resort to a generalized score I think!!! Ugh, slow!!!

"E1": 'np.abs(data_2["A3"])**2/(data_2["D1"] + .001)',

abs{a3)^2 / d1 + 0.001.

==> Heuristically I would still log this :-) 2 * ln(abs(a3)) - ln(d1) + ln(0.001)

"F1": '2data_2["D2"] + data_2["D3"] - data_2["E1"]data_2["A6"] + 8data_2["A7"]/data_2["A8"]'

2 * d2 + d3 - .... what is that? e1 a6?? + 8 a7 / a8? I have to check what concatenating variables in Python does... string concatenation?????!

==> I still have no idea what this even means yet, lol!!! :-D

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1669#issuecomment-1655046346, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE4JMHW2S3AACDGNWJMBZV3XSNFC3ANCNFSM6AAAAAA2ZBCWL4 . You are receiving this because you were mentioned.Message ID: @.***>

jdramsey commented 1 year ago

Sorry I haven't gotten back to you--we're all at the UAI conference here in Pittsburgh. I thought about the 1.8 issue and think the thing to do is to publish a separate version compiled under 1.8. I'm going to try to get this done today.

uvnikgupta commented 1 year ago

Yes, I was starting to wonder :) The agenda for the UAI conference sounds really cool. I have never attended any of its conferences but I can imagine the energy in that environment. I hope I am able to attend some day.

Coming back to the topic, I already have your working version for 1.8 so I am not really waiting for an official release. I am now more interested in figuring out why the algorithms are not performing well and how to tweak the data or the algorithm parameters to reproduce most of the DAG, if not fully.

Regards Uvnik

On Thu, Aug 3, 2023 at 12:49 PM Joseph Ramsey @.***> wrote:

Sorry I haven't gotten back to you--we're all at the UAI conference here in Pittsburgh. I thought about the 1.8 issue and think the thing to do is to publish a separate version compiled under 1.8. I'm going to try to get this done today.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1669#issuecomment-1664390490, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE4JMHVC5ULUM3SYVRB6P5LXTPQA7ANCNFSM6AAAAAA2ZBCWL4 . You are receiving this because you were mentioned.Message ID: @.***>

jdramsey commented 1 year ago

Sorry for the delay--we had a couple of dissertation defenses in the last week. Getting back to this.

I need to look at your Python code more carefully to see what assumptions are being honored. It wasn't clear to me on my first gander.

We had made a nonlinear simulator using Gaussian processes (and additive simulation) and GRaSP/BOSS did pretty well on that, but when we looked at the distributions, all of the functions had linear trends. it's been noticed in the past (I can get you a reference) that linear Gaussian scores like LG BIC tend to do OK whenever there are linear trends, and besides this, GRaSP/BOSS tend to do OK under a rather significant weakening of the faithfulness assumption, so some "sins" can be forgiven by the procedure. What I know will give the procedure difficult are the square and absolute value functions you use, which give dependencies but not becuase of linear trends. I'm wondering if you took those out how well the algorithms would do?

jdramsey commented 1 year ago

@uvnikgupta Wondering, have you had a chance to look at this?

cg09 commented 1 year ago

You are beset with new Tetrad problems. Sorry.

On Tue, Aug 22, 2023 at 1:25 PM Joseph Ramsey @.***> wrote:

@uvnikgupta https://github.com/uvnikgupta Wondering, have you had a chance to look at this?

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1669#issuecomment-1688620732, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OPTPAF7J4ORBX7ZP6LXWTTQ5ANCNFSM6AAAAAA2ZBCWL4 . You are receiving this because you were mentioned.Message ID: @.***>

jdramsey commented 1 year ago

Oh, I'm just trying to review outstanding issues and see what needs to be done. This particular issue involves trying to generalize to more algebraic functional forms for larger models, something I'm interested in and thinking of how to do.

jdramsey commented 1 year ago

I mean we do have the KCI general independence test, but it won't scale far enough for the problems suggested here. Also, it would be good to have a general score, and we've never implemented Biwei's general score in Tetrad, but Biwei's score won't handle these problems; there are too many variables, and the sample sizes are too large. I've been thinking about scores that are more general than LG but perhaps not completely general, which could handle a variety of distributions (but perhaps not all) and might be fast. I ask everyone I talk to whether they can think of such scores but no takers so far. I agree though it would be nice to have and a contribution to the literature.

cg09 commented 1 year ago

I am at a conference on ecology and causality. They are all about identifying unmeasured intermediate variables between input and output, but have no clue how to do it. I have data on turtles and soon to have data on penguins, I think.

Clark

On Tue, Aug 22, 2023 at 1:39 PM Joseph Ramsey @.***> wrote:

Oh, I'm just trying to review outstanding issues and see what needs to be done. This particular issue involves trying to generalize to more algebraic functional forms for larger models, something I'm interested in and thinking of how to do.

— Reply to this email directly, view it on GitHub https://github.com/cmu-phil/tetrad/issues/1669#issuecomment-1688638370, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4Y3OMYJEO5U22QNJ3F4W3XWTVGLANCNFSM6AAAAAA2ZBCWL4 . You are receiving this because you were mentioned.Message ID: @.***>

jdramsey commented 1 year ago

Interesting....

jdramsey commented 1 year ago

@uvnikgupta Sorry, I ended up with so many thing to do at beginning of term that I was losing track of them in my head. Let me write this one down so I can work on it some.

(I made a long to-do list recently and ordered it in terms of priorities. I think this is going to help.)

jdramsey commented 1 year ago

@uvnikgupta Let me characterize the problem this way. Is there a test/score that could be used that would recover at least approximately the correct DAG when the data are generated with simple combinations of functions? What combinations can work and which can't?

Is that fair?

jdramsey commented 1 year ago

@uvnikgupta Perhaps one of us should look to see if there's any literature on this already.

uvnikgupta commented 1 year ago

@jdramsey sorry, I am not sure if I understand your question completely. Are we trying to find a score that would compare a set of equations to the generated DAG? If yes, then I am do not understand why. The reason being that if I know the equations, I can already create the original DAG and then use scores like SHD to compare the generated vs the original graph.

jdramsey commented 1 year ago

@uvnikgupta That is, does anyone have a strategy for search a dataset with > 20 variables where the variables are generated by an SEM with the kinds of functions you're using? Also, with the sample sizes you have in mind?

You could use a general test like KCI, but it won't scale that far.