JanMarcoRuizdeVargas / clustercausal

The Repository supporting my Master's Thesis at TUM.
GNU Affero General Public License v3.0
1 stars 0 forks source link

A question about CDAG #2

Open creamiracle opened 1 year ago

creamiracle commented 1 year ago

Hey there, I have read your awesome work, and have a questions about using it into real data.

Do I need to spicify the relation between the cluster before discovery? I mean if I have a dataset has several cols, fore 3w * 50cols, and I can seperate cols into several kinds, such as "position cols, id cols and etc.", then I need to figure out the relation between kinds(clusters)? then I can get a discovery result(DAG) from the data? Am I correct?

Thanks.

JanMarcoRuizdeVargas commented 1 year ago

Hey there, I appreciate your interest!

Yes, the idea is to specify a C-DAG, where each vertice is a cluster containing several variables. You only specify relationships between clusters, while making no assumptions on relationships within them. To be very precise, you are forbidding edges between disconnected clusters and if you connect C1->C2, you are forbidding edges of the kind v1<-v2 where v1 in C1 and v2 in C2. If you want me to elaborate more, please tell.

May I ask on what kind of data you are interested in trying my method on? I am very curious about it and it might give me some ideas for the thesis, as well as adaption one could do to the C-DAG approach, to make it more convenient for practical applications.

Best regards, Jan Marco

creamiracle commented 12 months ago

Hey, thanks for replying.

I'm working on some real delivery data, such as "location, price, delivery time and etc." there are 120 cols in it, and 24 are discrete, others are continuous. My situation is to discovery the relation between these cols in order to find why the predict result is not good when I use the same features in machine learning models, I have tired CMU's tetrad, and some other packages, but still not good enough, so is there any advice to get the result?

Let me introduce again, there is a 30000*120 dataset, and it is used for a prediction machine learning model, but the prediction is not good, I want to use causal discovery to find the relation between the features and the target then try to explain why the model perform bad. The most important thing is "causal discovery".

Thanks : )

JanMarcoRuizdeVargas @.***> 于2023年7月15日周六 18:10写道:

Hey there, I appreciate your interest!

Yes, the idea is to specify a C-DAG, where each vertice is a cluster containing several variables. You only specify relationships between clusters, while making no assumptions on relationships within them. To be very precise, you are forbidding edges between disconnected clusters and if you connect C1->C2, you are forbidding edges of the kind v1<-v2 where v1 in C1 and v2 in C2. If you want me to elaborate more, please tell.

May I ask on what kind of data you are interested in trying my method on? I am very curious about it and it might give me some ideas for the thesis, as well as adaption one could do to the C-DAG approach, to make it more convenient for practical applications.

Best regards, Jan Marco

— Reply to this email directly, view it on GitHub https://github.com/JanMarcoRuizdeVargas/clustercausal/issues/2#issuecomment-1636727654, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADM4OIYEPDZRGHO72VXYLD3XQJUAPANCNFSM6AAAAAA2JXXRRA . You are receiving this because you authored the thread.Message ID: @.***>

-- Sincerely. Lin

JanMarcoRuizdeVargas commented 11 months ago

That sounds like an interesting, albeit challenging problem! While I don't think I or my package can solve your problem completely, let me give some ideas and my 2 cents. :)

If you are concerned with pure prediction accuracy, I am not sure if a causal graph of your variables could help you much - prediction accuracy can be improved in the causal and anticausal direction (see "Elements of Causal Inference", Schoelkopf et al., Chapter Machine Learning and Causality 1). Consider looking closely at what your model does, change the model type or do hyperparameter tuning with validation data.

Your problem sounds like a feature selection problem, and you want to eliminate some columns that are not useful. Maybe take a look at feature selection: https://en.wikipedia.org/wiki/Feature_selection . In addition, it sounds like you are working on a tabular dataset. If you are working with a deep neural network, changing the architecture to a random forest (XGBoost) could help. XGBoost is SOTA (https://arxiv.org/abs/2110.01889 ).

Related to that, if you want explainability on why your model predicts what it predicts, for tabular data you could use LIME (https://homes.cs.washington.edu/~marcotcr/blog/lime/). That could potentially uncover parts of your black box models. Other explainable AI (XAI) techniques could also be considered, shapley value comes to my mind.

In general, causal discovery is a hard task. Especially when you are working with 120 variables - to my knowledge, no current method can handle such a dimensionality with good accuracy (especially when the true SCM is not sparse, you suffer greatly due to the superexponential search space). (Side note: most methods need either purely discrete or continuous variables - if you have mixed data, consider discretizing the continuous or dropping/modifying the discrete). As a good introduction for causal discovery I recommend this paper: https://arxiv.org/abs/2303.15027 . This could guide you towards a workable solution.

My package could help in this way: If you know a priori that your data groups into 'clusters' and you can specifiy accurately relationships between clusters, you can use a cluster causal discovery algorithm. E.g. group all location data, price data and delivery speed data in one cluster each and specify a cdag like this: location -> price, location -> delivery but no relationship between price and delivery. Then run Cluster PC.

Be aware though, my algorithms are highly experimental, not extensively tested or verified and are changing, as i am making updates to my code. Currently, only cluster pc is working. This one however has the drawback of not allowing latent confounding between variables, which is probably happening in your case though. Additionally, it might also take very (very) long, 120 variables is a lot, if you dont have a sparsity you're in trouble. (you will see this if the algorithm doesn't delete edges very quickly). Let me also tell you about the causal discovery tools I know in python: causal learn, gCastle and causal discoery toolbox.

If I were you, I would begin with explainability and model selection. I am not very confident causal discovery on 120 variables will provide useful insights. The field as a whole is not quite there yet unfortunately. :/

If you do decide to use my package, I would be VERY interested in the results. I know your data and problem is probably confidential, but I would still be interested if it helps, where problems may lie.

Let me know what you think, and I hope I could help.

Best, Jan Marco