bd2kccd / py-causal

Other
204 stars 50 forks source link

Is it parallel? #74

Open marchezinixd opened 6 years ago

marchezinixd commented 6 years ago

I'm trying on a really large dataset and checking the resources usage. Apparently it is using only one core. is it possible to set it to use all cores and make it faster?

chirayukong commented 6 years ago

It should be. Which algorithm are you running? I'll take a look.

marchezinixd commented 6 years ago

The parameters i used were: FGES Sem-Bic Sem-Bic Penalty: 100

I have 4 cores and it is using 100% of one but nothing of the others. The dataset have 105 features, 2.6 million rows The memory is ok, it is using 14gb and i have a total of 32gb

jdramsey commented 6 years ago

Just one note: the part of FGES that parallelizes the best is the initial (usually most time-consuming) part. After that, there is a period where the parallelization isn't quite as good. You might for sanity's sake check to see if you're using more than one core when you first call the process.

Joe

On Thu, Sep 13, 2018 at 7:12 PM Guilherme Fernandes Marchezini < notifications@github.com> wrote:

The parameters i used were: FGES Sem-Bic Sem-Bic Penalty: 100

I have 4 cores and it is using 100% of one but nothing of the others. The dataset have 105 features, 2.6 million rows The memory is ok, it is using 14gb and i have a total of 32gb

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bd2kccd/py-causal/issues/74#issuecomment-421181095, or mute the thread https://github.com/notifications/unsubscribe-auth/AJZZRw65AzuibZqNEJZJ04B63HdAb9YCks5uauZsgaJpZM4WoTbn .

-- Joseph D. Ramsey Special Faculty and Director of Research Computing Department of Philosophy 135 Baker Hall Carnegie Mellon University Pittsburgh, PA 15213

jsph.ramsey@gmail.com Office: (412) 268-8063 http://www.andrew.cmu.edu/user/jdramsey

marchezinixd commented 6 years ago

Well i checked it. I reduced the penalty to 25 and ran it again. The attached image shows how it behaves. Basically there was a few seconds peak that used all cores. The second graph shows that it uses just one core at a time, but in sequentially uses all cores. performance

chirayukong commented 6 years ago

Try it on the causal-cmd cli. Attachment is its distribution. Run it with java -Xmx14G -jar causal-cmd-0.4.0-SNAPSHOT-jar-with-dependencies.jar --algorithm fges --data-type <discrete|continuous> --delimiter <comma|tab> --dataset <your_dataset> --score sem-bic --test sem-bic --penaltyDiscount 100 --json-graph. More about causal-cmd. causal-cmd-0.4.0-SNAPSHOT-distribution.zip

marchezinixd commented 6 years ago

Hello @chirayukong sorry for the long time to awnser, i was having trouble with the dataset and how to handle the full size. I did the test, in the cmd it ran in ~5 minutes and had the behaviour of the attached image. When running with python it took ~1 hour and had the same behavior of the previous images. Apparently the new jar have a better performance and parallelize more than the python one. Is it possible to update the pycausal?

screenshot from 2018-09-25 16-45-24 CMD test

chirayukong commented 6 years ago

The jar file is updated. Please try it. @marchezinixd

marchezinixd commented 6 years ago

The beginning was a little different, but still following the same old pattern, while the jar ran in 4 minutes, the python is running for 20 minutes and it seems it will not end soon. Apparently it is a python problem, maybe the way it handles parallelism? python

chirayukong commented 6 years ago

Maybe it's a problem on the javabridge library, which I don't know how to fix it. You can run it on causal-cmd and load the json result back to python.

chirayukong commented 6 years ago

This is the latest one. causal-cmd-0.4.0-SNAPSHOT-distribution.zip

marchezinixd commented 6 years ago

Well i'll do it for now. I'll leave the issue open in case you guys have any ideas how to solve the python problem. Thankyou