bd2kccd / causal-cmd

16 stars 8 forks source link

how to set missing-marker? #68

Closed williamty closed 1 year ago

williamty commented 2 years ago

I've met some error today:

D:\program>java -jar causal-cmd-1.3.0-jar-with-dependencies.jar --algorithm pc-all --data-type mixed --dataset ./w.csv --delimiter comma --numCategories 33 --test dg-lr-test --missing-marker *
Running version 1.3.0 but the latest version is 1.2.0.  To disable checking use the skip-latest option.
edu.pitt.dbmi.causal.cmd.ValidationException
        at edu.pitt.dbmi.causal.cmd.data.DataValidations.validateTabularData(DataValidations.java:121)
        at edu.pitt.dbmi.causal.cmd.data.DataValidations.validate(DataValidations.java:69)
        at edu.pitt.dbmi.causal.cmd.CausalCmdApplication.runTetrad(CausalCmdApplication.java:128)
        at edu.pitt.dbmi.causal.cmd.CausalCmdApplication.main(CausalCmdApplication.java:105)

I haven't found the introduction about missing-marker. So, I checked source code, but there's no clue.... How to set missing-marker in command line?

jdramsey commented 2 years ago

I guess I'd be curious what @kvb2univpitt says on this, but the default is *; can you just search and replace?

williamty commented 2 years ago

Yes, I used . Maybe that's because unlike Tetrad, causal-cmd can't read mixed data with missing-marker. cause I found these in the log file: `Line 120, column 56: Non-continuous number .`

jdramsey commented 2 years ago

Interestingly, the Tetrad app has no trouble reading in mixed data with missing values. But the default should be . I wonder if you simply don't specify --missing-marker or put the in single or double quotes if it will read it in?

Also, caveat--the app has moved to Tetrad version 7.1.0, which has mega bug fixes (all known bugs fixed) but causal-cmd hasn't has moved. But will soon.

jdramsey commented 2 years ago

Sorry for my ignorance on the inner workings of causal-cmd; it's not a project I've worked on. I've mainly been working on the Tetrad core code and the Tetrad app. But I'm sure we can sort this out. Perhaps my suggestions will work.

williamty commented 2 years ago

well, I've tried on Tetrad 7.1.0, but Tetrad always runs on front end, I can do nothing while Tetrad is running. So I turned to this causal-cmd program.

I used --missing-marker * without single nor double quotes. I think it just can't read in missing data.

jdramsey commented 2 years ago

Interesting. I don't suppose you can separately send me some sample data? I could check it out inside of Tetrad...

williamty commented 2 years ago

I'm sorry. The dataset is secret. :-(

jdramsey commented 2 years ago

Let me make you one then. Can you load this? The discrete variables are all trinary. testdata.txt

williamty commented 2 years ago

I got the same error message.

D:\program>java -jar causal-cmd-1.3.0-jar-with-dependencies.jar --algorithm pc-all --data-type mixed --dataset ./testdata.csv --delimiter comma --numCategories 3 --test dg-lr-test
Running version 1.3.0 but the latest version is 1.2.0.  To disable checking use the skip-latest option.
edu.pitt.dbmi.causal.cmd.ValidationException
        at edu.pitt.dbmi.causal.cmd.data.DataValidations.validateTabularData(DataValidations.java:121)
        at edu.pitt.dbmi.causal.cmd.data.DataValidations.validate(DataValidations.java:69)
        at edu.pitt.dbmi.causal.cmd.CausalCmdApplication.runTetrad(CausalCmdApplication.java:128)
        at edu.pitt.dbmi.causal.cmd.CausalCmdApplication.main(CausalCmdApplication.java:105)

I've changed the dataset to csv format. testdata.csv

jdramsey commented 2 years ago

Well, like I said, for 7.1.0 for causal-cmd we'll have to wait for Kevin. When you say you have to wait while Tetrad is running the algorithm for 7.1.0 app do you mean it blocks you from running other apps? It doesn't for me.

jdramsey commented 2 years ago

Also what are the dimensions of your dataset? I saw you had a variable with 33 categories, which makes me think it's a social dataset...

williamty commented 2 years ago

Yes, Tetrad 7.1.0 blocked me. There are about 60 variables in the dataset, and for discrete variables, there're 31 categories most. It's some data about epidemiology. So I turned backed to constrained based algorithm instead of neural network algorithms.

jdramsey commented 1 year ago

@williamty Sorry i've been working in another project (Tetrad) for a while now, coming back to some of these causal-cmd issues. 31 categories is a lot for a discrete model; the chi-square tables being build for variables of that dimension are enormous. Are there counts for all of the categories for these variables?

This is a standard problem; I may try to come up with a standard solution to it at some point.

kvb2univpitt commented 1 year ago

Sorry for the extremely late reply. For some reason, I just saw this.

To specify missing marker in causal-cmd, please use the --missing-marker parameter. For an example, --missing-marker * would treat * as missing marker.

jdramsey commented 1 year ago

Thanks, @kvb2univpitt --Joe

jdramsey commented 1 year ago

Looks like this is a stale issue, closing. @williamty Let us know if you're still interested in this and sorry for not getting back to it, been very busy... but high-dimensional discrete variables are of interest to me; perhaps at some point I could target the problem and come up with a good solution to it...