bd2kccd / causal-cmd

16 stars 8 forks source link

Issue with fges algorithm #73

Open Zarmas opened 2 years ago

Zarmas commented 2 years ago

Hello,

I am trying to make a causal inference study, and I am using the following command:

java -Xmx128G -jar causal-cmd-1.4.1-SNAPSHOT-jar-with-dependencies.jar --data-type continuous --delimiter tab --parallelized --json-graph --algorithm fges --dataset dataset.txt --score sem-bic-score --penaltyDiscount 10.0 --maxDegree 20

on a 14 thread and 128gb RAM system. The full dataset I enter are approximately 20000 continuous variables and 3000 samples.

The process gets killed after running for approximately 14 minutes. Even in smaller samples the process gets killed. From the top command it seems that memory is close to 100% before it stops running. On the log file, the last entry is the following:

Initializing effect edges: 21000

Tried dividing the set and the smallest data size it could run was 2000 variables and 100 samples. On the paper accompanying the caucal-cmd executable(Ramsey Et al. 2017), it is stated that it can be used for a million variables and more, but it seems impossible to use it on more than 2000 variables, on an above average system. What can I do in order to use it for the dataset I want to, with the system I currently use?

Thank you in advance, George

kvb2univpitt commented 2 years ago

Hello. Can you please provide the log file? What is the error message?

Zarmas commented 2 years ago

There is no error message, the command line just prints "Killed" along with the command I used.

On the causal-cmd.log, the messages stop at the following:

2022-08-03 09:17:31 INFO edu.pitt.dbmi.causal.cmd.data.DataValidations:166 - Start data validation on file dataset.txt. 2022-08-03 09:17:50 INFO edu.pitt.dbmi.causal.cmd.data.DataValidations:166 - End data validation on file dataset.txt. 2022-08-03 09:17:50 INFO edu.pitt.dbmi.causal.cmd.data.DataValidations:109 - There are 3000 cases and 20000 variables. 2022-08-03 09:17:50 INFO edu.pitt.dbmi.causal.cmd.data.DataFiles:166 - Start reading in file dataset.txt. 2022-08-03 09:18:10 INFO edu.pitt.dbmi.causal.cmd.data.DataFiles:166 - Finished reading in file dataset.txt. 2022-08-03 09:18:10 INFO edu.pitt.dbmi.causal.cmd.data.DataFiles:166 - File dataset.txt contains 3000 cases, 20000 variables.

And on the fges_*.txt, the last messages is the following: Start search: Wed, August 03, 2022 09:18:11 AM Initializing effect edges: 1000 Initializing effect edges: 2000 Initializing effect edges: 3000 Initializing effect edges: 4000 Initializing effect edges: 5000

jdramsey commented 2 years ago

Is 'verbose' on? I.e. --verbose? No, I don't see it. I think it's off by default It's possible the algorithm is running at this point but not printing anything?

jdramsey commented 2 years ago

Also I discovered some issues in my previous pipeline for doing large models which I need to think about.

@kvb2univpitt Is --faithfulnessAssumed false by default? It would help if it were false here. (I think that's the flag).

kvb2univpitt commented 2 years ago

@jdramsey Sorry, I missed your question. Yes, faithfulnessAssumed is set to false by default.

jdramsey commented 2 years ago

@kvb2univpitt @Zarmas Hmmm...... OK I need to think more. BTW the laptop I'm currently using is just 16 GB so it's a little hard for me to comment well, but I'm hopefully getting a 64 GB laptop soon and I'll try it when I get it. Also there's another different sort of algorithm I just thought of yesterday that I'd like to try on the larger laptop that would definitely benefit from parallelization. It's based on this paper that we just got published in UAI 2022:

https://arxiv.org/abs/2206.05421

though it does assume that you're interested in a particular variable (or list of variables). Would that happen to be true in your case, @Zarmas ? Or did you need causal structure for all 20,000 variables?

Zarmas commented 2 years ago

I am need the causal structure for all the variables, or to run it individually for all 20000 variables, if it can run fast.

jdramsey commented 2 years ago

Interestingly, just to note, I recently went back to look at old algorithm of mine from years ago called MBFS; I just renamed it to PC-MB for clarify. It's very fast for finding the Markov blanket or causes and effects about a variables. In fact, if the model is not too dense it's very competitive.

If you don't mind my being curious, what sort of application do you have where you actually need to know the causal structure over all 20,000 nodes? I can't currently think of one, which is why I've been going back to thinking of these partial algorithms the last few days. But maybe you convince me that it's worthwhile finding the whole causal structure so that i will be motivated to work on the problem for you.

jdramsey commented 2 years ago

(I ask this despite having spent a lot of time trying to find the causal structure over many nodes like that.)

jdramsey commented 2 years ago

Actually strategies for doing 20000 genes individually in parallel are occurring to me, but if I'm going to do this work for you you should send me emails so we can work out the details. I'm thinking if you have a big machine it might be doable, using a couple of strategies.

jdramsey commented 1 year ago

@Zarmas I take it this is a stale issue, so I'll close it. If you're still interested, you can re-open the issue.

jdramsey commented 1 year ago

@Zarmas Another comment--what I realized was that this was a problem specifically for running parallelized FGES in a context where more than one copy of FGES were also being run in parallel--something we hadn't previously tried to do. But it came up again recently, so then I looked back at this issue and realized it was the same issue. :-D

Zarmas commented 1 year ago

Hello again, I tried using the fges algorithm in the new version, in a dataset that could previously run in the 1.4.1 version with no problems. The command I used was the following:

java -Xmx24G -jar causal-cmd-1.7.0-SNAPSHOT-jar-with-dependencies.jar --data-type continuous --delimiter tab --parallelized --json-graph --algorithm fges --dataset dataset.txt --score sem-bic-score --penaltyDiscount 10.0 --maxDegree 20

And this time I get a new error. I wonder why a dataset that was small enough to be processed in the old version can no longer be supported. The error message I got was the following:

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.base/java.util.Arrays.copyOf(Arrays.java:3745) at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:172) at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:538) at java.base/java.lang.StringBuffer.append(StringBuffer.java:317) at java.base/java.io.StringWriter.write(StringWriter.java:106) at com.google.gson.stream.JsonWriter.newline(JsonWriter.java:654) at com.google.gson.stream.JsonWriter.beforeName(JsonWriter.java:669) at com.google.gson.stream.JsonWriter.writeDeferredName(JsonWriter.java:401) at com.google.gson.stream.JsonWriter.value(JsonWriter.java:417) at com.google.gson.internal.bind.TypeAdapters$EnumTypeAdapter.write(TypeAdapters.java:916) at com.google.gson.internal.bind.TypeAdapters$EnumTypeAdapter.write(TypeAdapters.java:859) at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:70) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:196) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:368) at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:70) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:196) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:368) at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:70) at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:97) at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:61) at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:70) at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.write(MapTypeAdapterFactory.java:207) at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.write(MapTypeAdapterFactory.java:144) at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:70) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:196) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:368) at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:70) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:196) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:368) at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:70) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.write(ReflectiveTypeAdapterFactory.java:196) at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.write(ReflectiveTypeAdapterFactory.java:368)

kvb2univpitt commented 1 year ago

The out-of-memory issue is caused by outputting graph in json format. See issue https://github.com/bd2kccd/causal-cmd/issues/89.

Zarmas commented 1 year ago

I am still having the original problem of not being able to run because of RAM limitations, even without the json graph.

jdramsey commented 1 year ago

Can you give me the dimensions of your problem again? I'm seriously not seeing any problems like this on my end, though maybe I've got the dimensions wrong. Also how much RAM are you using? I know there was a problem with the JSON graphs (that we're looking into).

jdramsey commented 1 year ago

I've got 64 GB of memory on my machine--is that more than you're using?

jdramsey commented 1 year ago

Also, is your problem continuous? Discrete? Mixed?

Zarmas commented 1 year ago

Continuous problem, I used it on a 128gb ram machine. Dimensions are 20000 variables and 3000 samples approximately. You can see details about it in the original message of this issue.

jdramsey commented 1 year ago

That's what I thought. OK, I'll try it again later.

jdramsey commented 1 year ago

OK, I just added a large FGES simulation module to our (new) py-tetrad project and got it up to 5000 variables by 3000 samples, average degree of 2. I guess I'll try to keep increasing the size of the problem incrementally to see how large I can make it.

BTW when you say you're doing 20,000 variables, I have gotta know, are you trying to run a genome? (Maybe we talked about this before, many thoughts though my mind since then.) If so, could you limit the variables given to FGES by identifying some target variables and removing all nodes not unconditionally dependent on any of those target variables?

Here's my simulation:

https://github.com/cmu-phil/py-tetrad/blob/main/examples/run_large_fges.py

Result:

Statistics:

numMeasures = Extra column for numMeasures
sampleSize = Extra column for sampleSize
AP = Adjacency Precision
AR = Adjacency Recall
AHP = Arrowhead precision
AHR = Arrowhead recall
E-CPU = Elapsed CPU Time in Seconds

Graphs are being compared to the true DAG.
All statistics are individually summarized over 1 runs using the indicated statistic.

AVERAGE VALUE

All edges

  Alg  numMeasures  sampleSize    AP    AR   AHP   AHR   E-CPU
    1      5000.00     3000.00  0.87  0.92  0.82  0.69  220.2

This is on MacOS Ventura, M1-Max chip.

jdramsey commented 1 year ago

Parallelized the search speeds up a little not much:

  Alg  numMeasures  sampleSize    AP    AR   AHP   AHR   E-CPU
    1      5000.00     3000.00  0.87  0.91  0.82  0.70  171.55

It's interesting that you unable to even do 200 variables with sample size 100:

Tried dividing the set and the smallest data size it could run was 2000 variables and 100 samples.

Now my interest is piqued... why can I do bigger datasets than you can? I haven't scaled to 10,000 or 20,000 variables yet because already I've gone further than you could.... why? Any ideas? I don't currently have an Ubuntu machine to try this on, but could you tell me, when you type

java --version

what does it print out?

By the way, it's taking far longer for my code to simulate a large sample like this than to actually run FGES on it. I used to parallelize the simulation code; I think I took that out after the million variable paper, but maybe I can put that back in.

Hold it, another question. Do you have any idea now dense your model is? I'm assuming very sparse in the above overall.

jdramsey commented 1 year ago

OK, here's what I get for 10000 variables, 3000 samples, average degree 2:

Algorithms:

1. FGES (Fast Greedy Equivalence Search) using Sem BIC Score

Statistics:

numMeasures = Extra column for numMeasures
sampleSize = Extra column for sampleSize
AP = Adjacency Precision
AR = Adjacency Recall
AHP = Arrowhead precision
AHR = Arrowhead recall
E-CPU = Elapsed CPU Time in Seconds

Graphs are being compared to the true DAG.
All statistics are individually summarized over 1 runs using the indicated statistic.

AVERAGE VALUE

All edges

  Alg  numMeasures  sampleSize    AP    AR   AHP   AHR   E-CPU
    1     10000.00     3000.00  0.77  0.92  0.72  0.75  755.67

This Python JPype module: https://github.com/cmu-phil/py-tetrad/blob/main/examples/run_large_fges.py

(The contents of that may change--I may go ahead and try 20,000 variables overnight...though I've got to get back to work at the moment.)

In any case, I hope this gives you some ideas. Either it's a platform issue I think that you're up against, or else your model is just very dense. I think FGES should be able to run on the problem given enough memory and time.

Oh--it could also be a data loading issue--I didn't load data here but simulated it.

Here's the JVM I was using for this:

openjdk version "11.0.9.1" 2020-11-04 LTS OpenJDK Runtime Environment Zulu11.43+55-CA (build 11.0.9.1+1-LTS) OpenJDK 64-Bit Server VM Zulu11.43+55-CA (build 11.0.9.1+1-LTS, mixed mode)

Zarmas commented 1 year ago

It is the same java versions I used in the fges-mb issue, 15.0.2, 18.0.2 and 11.0.18. The model is dense, did you set a low "max degree" when running the algorithm?

jdramsey commented 1 year ago

I did. Check the code.

jdramsey commented 1 year ago

Feel free to play with that script by the way.

jdramsey commented 1 year ago

By the way, "average degree 2" simply means that the total number of edges in the graph is about equal to the total number of nodes. Locally, it could be quite dense.

jdramsey commented 1 year ago

Which is why I'm quite curious what kind of data you're analyzing.

jdramsey commented 1 year ago

Here's another run I did with m = 10000, N = 3000, avg degree - 2, penalty discount = 8. good precision.

AVERAGE VALUE

All edges

  Alg  numMeasures  sampleSize    AP    AR   AHP   AHR   E-CPU
    1     10000.00     3000.00  1.00  0.83  0.99  0.55  988.25
jdramsey commented 1 year ago

I find that FGES does really well in the large, sparse case. For dense cases, precisions can fall. We have other algorithms that do really well in the dense regime (GRaSP, BOSS, etc.) but don't scale as well. Actually, we will have a version of BOSS that we scaled accurately to 1000 nodes, average degree of 10, which is not bad; that will come out in our next version I think. But in your case, to use it, you'd have to do some variable selection. m = 20000 is too much. I don't know of any other algorithms that will scale that well for the dense regime that are correct for linear, Gaussian data; perhaps for some specialized types of data, you might know of some.

Zarmas commented 1 year ago

Seems to run better with a lοw max degree, but for 2000 variables and 500 samples seems like it ran for too long(about 4 hours) and checking the machine, for most of the time only one cpu was being used so I am not sure it was running parallelized. It seems that the parallelized option is removed and default option is not using parallelization. This was the command I used:

java -Xmx24G -jar causal-cmd-1.7.0-SNAPSHOT-jar-with-dependencies.jar --data-type continuous --default --delimiter tab --algorithm fges --dataset data2000x500.txt --score sem-bic-score --penaltyDiscount 10.0 --maxDegree 3

I am wondering if I did something wrong in the command and it didn't run parallelized.

jdramsey commented 1 year ago

My script returns with an average degree of 6, m = 2000, n = 500. CPU elapsed time of 28 seconds. I am trying to figure out what's going on with your machine or with your dataset--you'll have to fiddle with the script I wrote and see if you can make it take that long. Unfortunately, I can't help with that part of the project. All I've got is my simulated data and my machine.

I could simulate some data of that size and try causal-cmd on it. I've got some things I need to do, and I need to run to campus for a faculty meeting, so that may have to wait until tomorrow.

Statistics:

numMeasures = Extra column for numMeasures
sampleSize = Extra column for sampleSize
AP = Adjacency Precision
AR = Adjacency Recall
AHP = Arrowhead precision
AHR = Arrowhead recall
E-CPU = Elapsed CPU Time in Seconds

Graphs are being compared to the true DAG.
All statistics are individually summarized over 1 runs using the indicated statistic.

AVERAGE VALUE

All edges

  Alg  numMeasures  sampleSize    AP    AR   AHP   AHR  E-CPU
    1      2000.00      500.00  1.00  0.53  0.99  0.47  28.21
jdramsey commented 1 year ago

Actually here I did it real quick. This ran in 31 seconds on my machine. Maybe it's because I used the --faithfulnessAssumed flag, which is fine for large, sparse models. (Sparse for 2000 variables can have a fairly high average degree, since density = avg degre / (# vars - 1).

java -Xmx24G -jar causal-cmd-1.6.1-jar-with-dependencies.jar --data-type continuous --default --delimiter comma --algorithm fges --dataset mydata.csv --score sem-bic-score --penaltyDiscount 10.0 --verbose --faithfulnessAssumed

This dataset was 2000 x 500, avg degree 6. I generated it using this script, with those parameter settings:

https://github.com/cmu-phil/py-tetrad/blob/main/examples/simulating_data.py

jdramsey commented 1 year ago

Yeah, it was the --faithfulnessAssumed parameter that did it, I'm sure.

jdramsey commented 1 year ago

Sorry I got caught up with several projects. I wanted to comment on parallelization for FGES. SFAIK, there are only two places where you can take advantage of parallelization. One is in the initial step, where you try to determine the "effect graph"--i.e., which single edge out of all possible single advantages to adding to the graph. That step is highly parallelizable and should engage all processors. Then proceeds, a very serial part of the algorithm, adding each edge. This is serial in that each edge addition requires the previous one to be added. Each edge addition is parallelizable in that you choose an advantage from all possible edges to add to parallelize that choice. But the overall algorithm can only be made partially parallel. This is a limitation of the algorithm.

jdramsey commented 1 year ago

We've been concentrating on scaling lately to denser graphs; we've only recently started to think again about scaling to graphs with many variables. Ideally, you'd want algorithms that can do both at once--we are thinking about that. Here's a paper we published for denser graphs:

Lam, W. Y., Andrews, B., & Ramsey, J. (2022, August). Greedy relaxations of the sparsest permutation algorithm. In Uncertainty in Artificial Intelligence (pp. 1052-1062). PMLR.

We're working on another algorithm like GRasP that scales to a larger number of variables, but we still need to scale it to 2000 for a dense graph. It's a hard problem; I don't know of any other algorithm in the literature that even tries to do that. For sparse graphs, FGES is a good option; in that sense, we've just not been interested in making better algorithms to scale to large numbers of variables for sparse graphs; there is already a good algorithm for that. Anyway, it's something we're thinking about.

jdramsey commented 1 year ago

@Zarmas OK, I finally took the time today to run an example for FGES with 20,000 variables. I used py-tetrad with this script:

https://github.com/cmu-phil/py-tetrad/blob/main/pytetrad/algcomparison_large_fges.py

with the larger parameter settings. This used 20,000 variables with an average degree of 6, N = 500; you can vary the parameters to suit. The simulation took 15 minutes; the algorithm took 51 minutes to run, with this result:

Alg  numMeasures  sample-size    AP    AR   AHP   AHR    E-CPU
  1     20000.00      500.00  1.00  0.53  0.99  0.46  3067.78

AP is adjacency precision; AHP is arrowhead precision; these are essentially perfect, so there's no false information in the resulting graph. The recall could be a lot higher for two possible reasons. First, the sample size is just 500; if the sample size were greater than that, the recall would come up. Also, the range for coefficient values was (-1, 1) with no interval taken out about zero; the weak coefficient could make it challenging to find edges.

So it can be done. I did set the Faithfulness parameter to True, as I suggested earlier.

Zarmas commented 1 year ago

I tried using the py-tetrad on a small dataset with a script python script, similar to the one you shared, but I didn't use the comparison functions, and it looks like this:

df = pd.read_csv("resources/dataset.txt", sep="\t") data = tr.pandas_to_tetrad(df)

score = ts.SemBicScore(data)

score.setPenaltyDiscount(8) score.setStructurePrior(0)

fges_graph = search.fges(score) print('FGES', fges_graph)

I get results without a problem, but I can't figure out how to set parameters like parallelized or maxDegree without the parameters module, similar to how I set the PenaltyDiscount using score.setPenaltyDiscount and can't find other parameter setting function when I looked into it more.