bd2kccd / causal-cmd

16 stars 8 forks source link

Score for covariance data type #38

Closed rodrigo-borges closed 5 years ago

rodrigo-borges commented 5 years ago

My only option to generate a sctructure from an over 40GB .csv is to calculate its covariance matrix (a few hundred rows by a few hundred columns) and use it as input.

I have used sem-bic score so far, but now my data is discrete and I would like to test some other score functions. I get an error with every option that is not either sem-bic or sem-bic-deterministic.

Score 'bdeu' is invalid for data-type 'covariance'.

Back when I changed the input type, I confirmed that the structures generated are the same from the .csv and the covariance matrix. So why is this change making me unable to use some of the scoring functions?

Is there a workaround for this? My problem with sem-bic is that when I further analyse the strength of the edges (using PyAgum), a lot of weak connections are encountered, and increasing the penalty reduces the number of both strong and weak edges.

kvb2univpitt commented 5 years ago

@jdramsey I'll tag Joe on this. Maybe he can explain this better.

jdramsey commented 5 years ago

So one issue is that bdeu is assuming you have discrete data, and covariance data is for the continuous case. Can you discretize it?

jdramsey commented 5 years ago

Oh wait, are you calculating a covariance matrix from discrete data? Could you just give it the discrete data instead?

rodrigo-borges commented 5 years ago

@jdramsey because my input file is too big (over 40GB .csv). Java throws me an OutOfMemory exception when I try it:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at edu.pitt.dbmi.data.reader.tabular.AbstractDiscreteTabularDataFileReader.extractAndEncodeData(AbstractDiscreteTabularDataFileReader.java:50)
    at edu.pitt.dbmi.data.reader.tabular.VerticalDiscreteTabularDataReader.readInDataFromFile(VerticalDiscreteTabularDataReader.java:47)
    at edu.pitt.dbmi.data.reader.tabular.AbstractTabularDataFileReader.readInData(AbstractTabularDataFileReader.java:52)
    at edu.pitt.dbmi.causal.cmd.tetrad.TetradUtils.getTabularData(TetradUtils.java:293)
    at edu.pitt.dbmi.causal.cmd.tetrad.TetradUtils.getDataModels(TetradUtils.java:197)
    at edu.pitt.dbmi.causal.cmd.tetrad.TetradAlgorithmRunner.runAlgorithm(TetradAlgorithmRunner.java:71)
    at edu.pitt.dbmi.causal.cmd.tetrad.TetradRunner.runTetrad(TetradRunner.java:67)
    at edu.pitt.dbmi.causal.cmd.CausalCmdApplication.main(CausalCmdApplication.java:83)

The use of a covariance matrix came from this issue, since the problem was the same. Is there an alternative solution for discrete data?