Make test/score that will work for algebraically defined nonlinear models.

cmu-phil / tetrad

Repository for the Tetrad Project, www.phil.cmu.edu/tetrad.

GNU General Public License v2.0

407 stars 111 forks source link

Make test/score that will work for algebraically defined nonlinear models. #1669

Open uvnikgupta opened 1 year ago

uvnikgupta commented 1 year ago

Loading the attached csv throws the following exception:

Infer demiliter for file: 20_nodes_normal.csv Exception in thread "AWT-EventQueue-0" java.lang.NoSuchMethodError: java.nio.ByteBuffer.clear()Ljava/nio/ByteBuffer; at edu.pitt.dbmi.data.reader.util.TextFileUtils.inferDelimiter(TextFileUtils.java:135) at edu.cmu.tetradapp.editor.LoadDataSettings.getInferredDelimiter(LoadDataSettings.java:882) at edu.cmu.tetradapp.editor.LoadDataSettings.basicSettings(LoadDataSettings.java:503) at edu.cmu.tetradapp.editor.LoadDataDialog.showDataLoaderDialog(LoadDataDialog.java:165) at edu.cmu.tetradapp.editor.LoadDataAction.actionPerformed(LoadDataAction.java:91) at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2022) at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2348) at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402) at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259) at javax.swing.AbstractButton.doClick(AbstractButton.java:376) at javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:842) at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:886) at java.awt.Component.processMouseEvent(Component.java:6539) at javax.swing.JComponent.processMouseEvent(JComponent.java:3324) at java.awt.Component.processEvent(Component.java:6304) at java.awt.Container.processEvent(Container.java:2239) at java.awt.Component.dispatchEventImpl(Component.java:4889) at java.awt.Container.dispatchEventImpl(Container.java:2297) at java.awt.Component.dispatchEvent(Component.java:4711) at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4904) at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4535) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4476) at java.awt.Container.dispatchEventImpl(Container.java:2283) at java.awt.Window.dispatchEventImpl(Window.java:2746) at java.awt.Component.dispatchEvent(Component.java:4711) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:760) at java.awt.EventQueue.access$500(EventQueue.java:97) at java.awt.EventQueue$3.run(EventQueue.java:709) at java.awt.EventQueue$3.run(EventQueue.java:703) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74) at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:84) at java.awt.EventQueue$4.run(EventQueue.java:733) at java.awt.EventQueue$4.run(EventQueue.java:731) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74) at java.awt.EventQueue.dispatchEvent(EventQueue.java:730) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:205) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93) at java.awt.EventDispatchThread.run(EventDispatchThread.java:82) 20_nodes_normal.csv


    
            
            
                jdramsey
                commented
                 1 year ago            
            
                Actually your file didn't come through; you may need to zip it before attaching it (I've found)...
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                One second I found your link...
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                Ah. It's not a covariance matrix. You can load it as tabular data--see the picture I took.
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                Hold on, sorry, you didn't actually say it was a covariance matrix. But huh, it loads for me..... can you tell me more about how you're trying to load it?
            
        
            
            
                kvb2univpitt
                commented
                 1 year ago            
            
                @uvnikgupta What version of Java are you using?
            
        
            
            
                uvnikgupta
                commented
                 1 year ago            
            
                Java version:
openjdk version "1.8.0_332"
OpenJDK Runtime Environment (Temurin)(build 1.8.0_332-b09)
OpenJDK 64-Bit Server VM (Temurin)(build 25.332-b09, mixed mode)
I am launching the jar using :
 java -Xmx2G -jar tetrad-gui-7.4.0-launch.jar


            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                Thanks for the update. Sorry, I was multitasking yesterday. This is a bug we know about (thanks @kvb2univpitt). The issue (if you want to know) is that Oracle changed the implementation of the ByteBuffer class so that it's incompatible between version 1.8 and versions > 1.8. It's this bug:
https://www.morling.dev/blog/bytebuffer-and-the-dreaded-nosuchmethoderror/
except in your case it's the clear() method that's the problem and not the position() method. You're using OpenJDK 1.8, I'm guessing on a Linux box? (Actually can you confirm that?) What I'll do (sorry just trying different things here) is the casting they suggest in the article to see if it will work in OpenJDK1.8 for me. (It needs to work both for 1.8 and for > 1.8 unfortunately, which is the issue.) Unfortunately I'm on a Mac at the moment and the only JDK 1.8 I can get anymore is Amazon's, and it's not a problem there. When I get back home today I'll try installing OpenJDK 1.8 on my Windows laptop (I think I can still do that, though I can no longer get it from M$) and test it there. But really what I need to do is test it on Linux, using OpenJDK 1.8, and I don't have a Linux box currently.
If I made you a version (or maybe two versions) to test, would you be willing to try them out on your machine? That would help a lot.
            
        
            
            
                uvnikgupta
                commented
                 1 year ago            
            
                @jdramsey, Thanks a lot for explaining the issue.
I am using Widows 10.
Yes, I am ok to try the test versions 
            
        
            
            
                uvnikgupta
                commented
                 1 year ago            
            
                maybe you could get the open jdk 8 from here : https://www.openlogic.com/openjdk-downloads?field_java_parent_version_target_id=416&field_operating_system_target_id=436&field_architecture_target_id=391&field_java_package_target_id=396
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                Awesome--Let me grab the Mac version now and test it, and then I can download the Windows one later and test it there. Fingers crossed! We (well @kvb2univpitt) were thinking of rewriting that section of code without using ByteBuffer, but hopefully this fixes it without that effort.
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                Actually they're not providing any Mac options--it's in their selector but you only get Windows options in the list. I'm at the office right now but can do this later when I get home; my Windows laptop is there.
I just tested it using Amazon's Corretto 1.8 on Mac and it works there, though I suspect Amazon may have gone in and fixed the issue internally.
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                Oh hold on, they did have it! It's just that their dropdown was broken; I had to select "all" and then the Mac options showed up. I test it--it works! That gives me some confidence that it will work on Windows as well using the a Windows 1.8 download from this site, but I can test it later.
            
        
            
            
                kvb2univpitt
                commented
                 1 year ago            
            
                The problem goes away if you use Java 11 and above.
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                @kvb2univpitt I am motivated to figure it out because we have users who are not in a position to grab a newer version of Java. I may have figured it out though--I'll let you know! I'm going to test it now on Windiows.
            
        
            
            
                uvnikgupta
                commented
                 1 year ago            
            
                
@kvb2univpitt I am motivated to figure it out because we have users who are not in a position to grab a newer version of Java. I may have figured it out though--I'll let you know! I'm going to test it now on Windiows.

I am one of those in that group :)
            
        
            
            
                kvb2univpitt
                commented
                 1 year ago            
            
                @jdramsey We definitely need to get rid of the ByteBuffer.  By "we" I mean "me".
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                @uvnikgupta @kvb2univpitt Could you both try to break this version? I.e., launch it, try to load a dataset...
https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/cmu-phil/tetrad-gui/7.4.0-SNAPSHOT/tetrad-gui-7.4.0-20230728.001143-5-launch.jar
If it works I will tell you what I did.
            
        
            
            
                uvnikgupta
                commented
                 1 year ago            
            
                
@uvnikgupta @kvb2univpitt Could you both try to break this version? I.e., launch it, try to load a dataset...
https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/cmu-phil/tetrad-gui/7.4.0-SNAPSHOT/tetrad-gui-7.4.0-20230728.001143-5-launch.jar
If it works I will tell you what I did.

Sure. On it :)
Tried different datasets and it seems to work pretty fine now 👍
Thanks for the quick fix
            
        
            
            
                uvnikgupta
                commented
                 1 year ago            
            
                Tried a few more and data loading + Search works flawlessly. The only issue now is the the resulting graph is nowhere close to the actual graph :( I guess that is state of the existing discovery algorithms due to the nature of the problem.
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                I'm very curious what experience Kevin has. I compiled this under Corretto 1.8 and have no trouble running under 1.8 or 11 on my Mac, so if you have no trouble on Windows, I'll try under 11 under Windows.
Not sure what to say about the content. Maybe if you tell me the general nature of the problem and what you've tried I could comment?
            
        
            
            
                uvnikgupta
                commented
                 1 year ago            
            
                I am loading the data and connecting to the search box. Then executing search using different algorithms. Finally comparing the result with the actual DAG. The data and the actual DAG is attached for your reference
20_nodes_normal.csv

BTW, I encountered a Null pointer issue when I tried to use the "Regression"
            
        
            
            
                cg09
                commented
                 1 year ago            
            
                Are these Gaussian variables?  With what sample size?
On Thu, Jul 27, 2023 at 9:28 PM kelearin @.***> wrote:

I am loading the data and connecting to the search box. Then executing
search using different algorithms. Finally comparing the result with the
actual DAG. The data and the actual DAG is attached for your reference
20_nodes_normal.csv
https://github.com/cmu-phil/tetrad/files/12189230/20_nodes_normal.csv
[image: image]
https://user-images.githubusercontent.com/20485662/256699118-c585c8fe-048a-4e90-bbe4-969c12ddf0b8.png
BTW, I encountered a Null pointer issue when I tried to use the
"Regression"
[image: image]
https://user-images.githubusercontent.com/20485662/256699297-6ae9f1fb-2b24-46e3-ae23-fff1296432f0.png
—
Reply to this email directly, view it on GitHub
https://github.com/cmu-phil/tetrad/issues/1669#issuecomment-1654840984,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AD4Y3OON3557TVEGSNGP7KLXSMISRANCNFSM6AAAAAA2ZBCWL4
.
You are receiving this because you are subscribed to this thread.Message
ID: @.***>
            
        
            
            
                uvnikgupta
                commented
                 1 year ago            
            
                not able to attach my data generator .py file. So below is the formulae:

"A1": "0.0",
"A2": "0.0",
"A3": "0.0",
"A4": "0.0",
"A5": "0.0",
"A6": "0.0",
"A7": "0.0",
"A8": "0.0",
"B1": 'data_2["A1"]2',
"B2": 'data_2["A1"]',
"C2": 'np.sqrt(np.abs(data_2["B1"]))',
"C3": 'data_2["B1"] * data_2["B2"]',
"D2": 'data_2["C2"]2 + data_2["C3"] - data_2["A2"]2',
"C4": 'data_2["B2"]3',
"D3": 'np.sqrt(np.abs(data_2["C4"]))',
"B3": 'data_2["A4"]2 + data_2["A5"]',
"C1": 'data_2["B3"]*2',
"D1": 'np.round(np.mod(1000data_2["C1"], 10), 3)',
"E1": 'np.abs(data_2["A3"])2/(data_2["D1"] + .001)',
"F1": '2data_2["D2"] + data_2["D3"] - data_2["E1"]data_2["A6"] + 8*data_2["A7"]/data_2["A8"]'

I add np.random.normal(loc=5, scale=1, size=self.size) to each of the variables above
            

        

            
            
                jdramsey
                commented
                 1 year ago            
            
                They are not terribly Gaussian. By the way @uvnikgupta if you'd like to switch to email I'm happy. @cg09 if you load up the data that was sent in the version of Tetrad given above and use the Plot Matrix tool you can see the distributions of the variables.
            
        
            
            
                uvnikgupta
                commented
                 1 year ago            
            
                
They are not terribly Gaussian. By the way @uvnikgupta if you'd like to switch to email I'm happy. @cg09 if you load up the data that was sent in the version of Tetrad given above and use the Plot Matrix tool you can see the distributions of the variables.

yes, I can share my data generation python code then. Please DM me at 
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                That's what I thought--nonlinear algebraic functions generated them...You know we were just thinking of how to incorporate this sort of nonlinear additivity into a fast score... 
            
        
            
            
                cg09
                commented
                 1 year ago            
            
                What sort of "non-linear algebraic" functions?
On Thu, Jul 27, 2023 at 10:08 PM Joseph Ramsey @.***>
wrote:

That's what I thought--nonlinear algebraic functions generated them...You
know we were just thinking of how to incorporate this sort of nonlinear
additivity into a fast score...
—
Reply to this email directly, view it on GitHub
https://github.com/cmu-phil/tetrad/issues/1669#issuecomment-1654870838,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AD4Y3OJHDQAYRO4CSF42AHLXSMNJDANCNFSM6AAAAAA2ZBCWL4
.
You are receiving this because you were mentioned.Message ID:
@.***>
            
        
            
            
                uvnikgupta
                commented
                 1 year ago            
            
                The formula simplifies to : 2D2 + D3 - E1A6 + 8*A7/A8
I hope you received the python scripts I shared. You can generate data of
any size with that script. Just modify the size parameter in the
instantiations of the DataGenerator class under if name == "main",
create a data folder and run "python data_generator.py". Of course you have
to pip install pandas, numpy and scipy.
On Fri, Jul 28, 2023 at 12:31 AM Joseph Ramsey @.***>
wrote:

b1 = a2^2
==> log(b1) - 2 ln a2
b2 = a1
Singularity, you'll need to remove one of two columns or teach you
algorithm to deal with it. But you can't use regression here in any form.
(This is why the regression check is failing, above, BTW).
c2 = sqrt(abs(b1))
==> Hmmm... you need to check a symmetric function here of b1 to find the
dependency.
c3 = b1 * b2
==>ln(c3) = ln(b1) + ln(b2)
c2^2 + c3 - a2^2
==> Logging won't help here for the entire function! But logging c2 and
logging b2 separately would help if you knew to do that! Hmmm...
c4 = b2^3
==> ln(c4) = 3 * ln(b2)...no problem.
sqrt(|c4|)
==> Another symmetric function.
b3 = a4^2 + a5
==> Logging a4 separately would have helped.
c1 = b3^2
==> Logging solves this.
"D1": 'np.round(np.mod(1000data_2["C1"], 10), 3)',
Not sure how to describe this one in words yet, I'll come back to it.
==> NO HELP HERE! You need to resort to a generalized score I think!!!
Ugh, slow!!!
"E1": 'np.abs(data_2["A3"])**2/(data_2["D1"] + .001)',
abs{a3)^2 / d1 + 0.001.
==> Heuristically I would still log this :-) 2 * ln(abs(a3)) - ln(d1) +
ln(0.001)
"F1": '2data_2["D2"] + data_2["D3"] - data_2["E1"]data_2["A6"] +
8data_2["A7"]/data_2["A8"]'
2 * d2 + d3 - .... what is that? e1 a6?? + 8 a7 / a8? I have to check what
concatenating variables in Python does... string concatenation?????!
==> I still have no idea what this even means yet, lol!!! :-D
—
Reply to this email directly, view it on GitHub
https://github.com/cmu-phil/tetrad/issues/1669#issuecomment-1655046346,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AE4JMHW2S3AACDGNWJMBZV3XSNFC3ANCNFSM6AAAAAA2ZBCWL4
.
You are receiving this because you were mentioned.Message ID:
@.***>
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                Sorry I haven't gotten back to you--we're all at the UAI conference here in Pittsburgh. I thought about the 1.8 issue and think the thing to do is to publish a separate version compiled under 1.8. I'm going to try to get this done today.
            
        
            
            
                uvnikgupta
                commented
                 1 year ago            
            
                Yes, I was starting to wonder :)
The agenda for the UAI conference sounds really cool. I have never attended
any of its conferences but I can imagine the energy in that environment. I
hope I am able to attend some day.
Coming back to the topic, I already have your working version for 1.8 so I
am not really waiting for an official release. I am now more interested in
figuring out why the algorithms are not performing well and how to tweak
the data or the algorithm parameters to reproduce most of the DAG, if not
fully.
Regards
Uvnik
On Thu, Aug 3, 2023 at 12:49 PM Joseph Ramsey @.***>
wrote:

Sorry I haven't gotten back to you--we're all at the UAI conference here
in Pittsburgh. I thought about the 1.8 issue and think the thing to do is
to publish a separate version compiled under 1.8. I'm going to try to get
this done today.
—
Reply to this email directly, view it on GitHub
https://github.com/cmu-phil/tetrad/issues/1669#issuecomment-1664390490,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AE4JMHVC5ULUM3SYVRB6P5LXTPQA7ANCNFSM6AAAAAA2ZBCWL4
.
You are receiving this because you were mentioned.Message ID:
@.***>
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                Sorry for the delay--we had a couple of dissertation defenses in the last week. Getting back to this.
I need to look at your Python code more carefully to see what assumptions are being honored. It wasn't clear to me on my first gander.
We had made a nonlinear simulator using Gaussian processes (and additive simulation) and GRaSP/BOSS did pretty well on that, but when we looked at the distributions, all of the functions had linear trends. it's been noticed in the past (I can get you a reference) that linear Gaussian scores like LG BIC tend to do OK whenever there are linear trends, and besides this, GRaSP/BOSS tend to do OK under a rather significant weakening of the faithfulness assumption, so some "sins" can be forgiven by the procedure. What I know will give the procedure difficult are the square and absolute value functions you use, which give dependencies but not becuase of linear trends. I'm wondering if you took those out how well the algorithms would do? 
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                @uvnikgupta Wondering, have you had a chance to look at this?
            
        
            
            
                cg09
                commented
                 1 year ago            
            
                You are beset with new Tetrad problems.  Sorry.
On Tue, Aug 22, 2023 at 1:25 PM Joseph Ramsey @.***>
wrote:

@uvnikgupta https://github.com/uvnikgupta Wondering, have you had a
chance to look at this?
—
Reply to this email directly, view it on GitHub
https://github.com/cmu-phil/tetrad/issues/1669#issuecomment-1688620732,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AD4Y3OPTPAF7J4ORBX7ZP6LXWTTQ5ANCNFSM6AAAAAA2ZBCWL4
.
You are receiving this because you were mentioned.Message ID:
@.***>
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                Oh, I'm just trying to review outstanding issues and see what needs to be done. This particular issue involves trying to generalize to more algebraic functional forms for larger models, something I'm interested in and thinking of how to do.
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                I mean we do have the KCI general independence test, but it won't scale far enough for the problems suggested here. Also, it would be good to have a general score, and we've never implemented Biwei's general score in Tetrad, but Biwei's score won't handle these problems; there are too many variables, and the sample sizes are too large. I've been thinking about scores that are more general than LG but perhaps not completely general, which could handle a variety of distributions (but perhaps not all) and might be fast. I ask everyone I talk to whether they can think of such scores but no takers so far. I agree though it would be nice to have and a contribution to the literature.
            
        
            
            
                cg09
                commented
                 1 year ago            
            
                I am at a conference on ecology and causality. They are all about
identifying unmeasured intermediate variables between input and output, but
have no clue how to do it.  I have data on turtles and soon to have data on
penguins, I think.
Clark
On Tue, Aug 22, 2023 at 1:39 PM Joseph Ramsey @.***>
wrote:

Oh, I'm just trying to review outstanding issues and see what needs to be
done. This particular issue involves trying to generalize to more algebraic
functional forms for larger models, something I'm interested in and
thinking of how to do.
—
Reply to this email directly, view it on GitHub
https://github.com/cmu-phil/tetrad/issues/1669#issuecomment-1688638370,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AD4Y3OMYJEO5U22QNJ3F4W3XWTVGLANCNFSM6AAAAAA2ZBCWL4
.
You are receiving this because you were mentioned.Message ID:
@.***>
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                Interesting....
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                @uvnikgupta Sorry, I ended up with so many thing to do at beginning of term that I was losing track of them in my head. Let me write this one down so I can work on it some.
(I made a long to-do list recently and ordered it in terms of priorities. I think this is going to help.)
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                @uvnikgupta Let me characterize the problem this way. Is there a test/score that could be used that would recover at least approximately the correct DAG when the data are generated with simple combinations of functions? What combinations can work and which can't?
Is that fair?
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                @uvnikgupta Perhaps one of us should look to see if there's any literature on this already.
            
        
            
            
                uvnikgupta
                commented
                 1 year ago            
            
                @jdramsey sorry, I am not sure if I understand your question completely. Are we trying to find a score that would compare a set of equations to the generated DAG? If yes, then I am do not understand why. The reason being that if I know the equations, I can already create the original DAG and then use scores like SHD to compare the generated vs the original graph. 
            
        
            
            
                jdramsey
                commented
                 1 year ago            
            
                @uvnikgupta That is, does anyone have a strategy for search a dataset with > 20 variables where the variables are generated by an SEM with the kinds of functions you're using? Also, with the sample sizes you have in mind?
You could use a general test like KCI, but it won't scale that far.
            
        
    
    
            

    
        
            ©  Githubissues.
            Githubissues is a development platform for aggregating issues.