fernandoPalluzzi / SEMgraph

Causal Structure Learning and Network Analysis with Structural Equation Modeling.
GNU General Public License v3.0
21 stars 1 forks source link

About parallelization #22

Open aldosc opened 1 year ago

aldosc commented 1 year ago

Hi, I started some network analysis by using SEMgraph. I wanted to know how the parallelization is working. Is it already available by default when running a function (i.e. SEMace)? Is there a parameter to be defined for calling the number of cores?

Thanks!

fernandoPalluzzi commented 1 year ago

Hi, SEMgraph does not currently support parallelization for all its functions. Sometimes, heuristics seem to be a reasonable and faster solution. I'll briefly explain you both cases.

Generally, inference functions, such as SEMrun and SEMace, use fitting heuristics (e.g., RICF or Gaussian Graphical Modeling). Bootstrap is often used in combination with heuristics to generate robust estimates for standard errors. Heuristics are generally activated automatically, beyond a certain network size, that can be controlled with the 'limit' argument (see SEMrun documentation). Bootstrap is enabled/disabled through either 'n_rep' (see SEMrun, SEMgsa) or 'boot' (SEMace) arguments.

Parallelization settings are instead used for learning-related functions (SEMbap, modelSearch, and weightGraph). For the time being, the user can only determin if parallelization will be enabled, using the limit argument. By default, all the available cores will be used. To my experience, this never caused memory consumption or system-related issues.

Hope this helps! Just contact me again, if you need more info.

Best,

Fernando

aldosc commented 1 year ago

Hi Fernando,

Thank you for your rapid reply. I understand what you mentioned in your message. The reason I asked about this because I'm currently running SEMace and wondering if it can be parallelized. For example, based on my network and experimental data, it seems it will take some time to complete the task (see attachment). That's how it looks after 1.5 hrs.

Best,

Aldo

Screenshot_20220901-130418_Outlook

fernandoPalluzzi commented 1 year ago

Hi Aldo,

I see... this is a huge network! SEMace finds over 1M directed paths. Unfortunately, the only solution I see for the time being is to apply some reduction of your initial network. If your aim is to find relevant (perturbed) source-sink paths in your initial network, I suggest you to weight your network and extract a Steiner tree (or any minimum-cost tree) from your initial network. Then you should be able to apply SEMace without huge problems. Another alternative would be clustering.

If you want to do it fully unbiased (from your input network) you definitely need parallelization, but I do not have a solution right away. I need some time to find one and test it. I'll get back to you as soon as possible.