ahmedmalaa / Symbolic-Metamodeling

Codebase for "Demystifying Black-box Models with Symbolic Metamodels", NeurIPS 2019.
48 stars 23 forks source link

Improving the performance of the code through parallelization or use of GPU #2

Closed tomvars closed 3 years ago

tomvars commented 4 years ago

Hi Ahmed,

Thank you for making your code open-source and simple to use. We're using this to explain some black-box models and it works as you would expect. The problem is that the code runs on a single CPU thread, and takes a very long time to run. Is there any existing approach to getting faster training? Are there any plans to improve performance?

Best, Tom

ahmedmalaa commented 4 years ago

Hi tomvars,

Could you please let me know what is the dimensionality of your feature space?

There are plans for a more efficient implementation but I am afraid I have no scheduled updates soon as I am busy with other projects. However, I may be able to help you speed up your experiment if you provide me with more details on the data dimensions.

Thanks.

tomvars commented 4 years ago

Hi Ahmed,

At the moment we're using 27 features. How does the algorithm's complexity grow with feature size K?

I should also mention this is for a COVID-19 research project.

Best, Tom

ahmedmalaa commented 4 years ago

Hi Tom,

I see. 27 features is a lot because the algorithm searches for many interaction terms between subsets of features so the complexity grows by (num_features choose 2) just to cover pairwise interactions. With your number of features, this means that the final expression will have 351 G functions and that's why it is slow. The alternative Kolmogorov representation also grows exponentially with number of features. I think that our approach is most useful when you have only few features but these interact in a complex way so you can get a precise mathematical equation that describes their interaction.

To proceed with your analysis, I suggest you first do feature selection to include only relevant features into your model. If these relevant features are less than 10, then the algorithm should run in a reasonable amount of time. In my experience, most medical data sets are usually dominated by only few strong predictors so I guess feature selection as a pre-processing step may give you s thinner data set.

Thanks for working on COVID-19! Wish you good luck with your analysis!