Closed jmafoster1 closed 1 year ago
Sounds like an interesting problem to delve into!
Would you mind sharing the setup you are using that takes a while to run?
It's in a repo called causal-test-adequacy
in case-study-3
. If you just do python run_causal_tests.py --json_path causal_tests.json --dag_path dag.dot --data_path data_500_random_age.csv
, then it'll sit there probably for about 5 minutes before you observe any output at all
Which version of the CTF are you using with this repo?
Just what's on main
right now
Part of the problem is lines 50-70 of data_collector.py
where we're doing all the Z3 stuff. It's very slow for large volumes of data, and it seems to be happening (several times?) for every causal test case, where really it should just be happening once, and could be skipped altogether where scenarios have no constraints.
I've also noticed that causal tests now seem to be taking up my entire CPU, like all 20 threads. I did a quick search for multiprocessing
and can't find any in the CTF at all, so I'm confused as to how this comes about...
Is this something that definitely was not happening before? If so it could be an update to Numpy or similar. As certain libraries can make use of multiple threads without the explict use of the multiprocessing library:
libraries that perform computationally heavy tasks like numpy, scipy and pytorch utilise C-based implementations under the hood, allowing the use of multiple cores.
https://towardsdatascience.com/demystifying-python-multiprocessing-and-multithreading-9b62f9875a27#:~:text=Python%20is%20NOT%20a%20single,the%20use%20of%20multiple%20cores.
I'm not sure how long it's been going on, but it seems like a new thing. I just profiled a run with yappi
and it looks like the execute_test
is being called over 200 times, which is suspicious...
name ncall tsub ttot tavg
..:103 CausalTestEngine.execute_test 201 0.003219 30.99038 0.154181
Ah scratch that. That was my bootstrapping for the test adequacy
Interestingly, though, yappy
seems to think there's only one thread in the application, even though it's maxing out all 20 of my CPU cores
I'm not sure how yappi
counts the threads. But if the multiprocessing is happening at a lower level like in the C
implementations of Numpy etc, then there will still only be 1 python thread
object I believe
OK, the problem is with statsmodels. I found this out using threadpoolctl and limiting various areas of the code to a single process until I worked out where the problem was. Useful link: https://github.com/statsmodels/statsmodels/issues/2914
Having run the CTF again on the Poisson process data, there's a tiny CPU spike there too, but it's much more noticeable here since the data files are much bigger and being called for longer to estimate test adequacy, so the multiprocessing was always there, but is just much more noticeable now because of what I'm doing.
With the JSON frontend, when you run it with a large amount of data, it seems to just sit there for a while before it starts showing the output of any tests. I think this is in some way related to #206, but the waiting seems to be proportional to the amount of data in the CSV file, so there's probably something else going on. When I run the estimators in a separate python file just with
statsmodels.ols
, it runs almost instantly, even with very large amounts of data. @cwild-UoS, please could you take a look to see what's happening if you get the chance?