Skewed Data Distributions and Homoscedasticity - Githubissues

cdt15 / lingam

Python package for causal discovery based on LiNGAM.

https://sites.google.com/view/sshimizu06/lingam

MIT License

366 stars 56 forks source link

Skewed Data Distributions and Homoscedasticity #81

Open vdemchenko3 opened 1 year ago

vdemchenko3 commented 1 year ago

Hi,

I'm wondering what's the best approach for data that is highly right-skewed. Is it best to take a log transform of it to make it more "normal" or does DirectLiNGAM deal with skewed data? The causal graphs are substantially different if I take the log and then normalise the data compared to only normalising the data and keeping the skewed distribution. I couldn't find the implementations of Hyvarinen & Smith 2013 for skewed data.

Also, my understanding is that LiNGAM is specifically made for non-Gaussian distributions, but I'm a bit confused about how this impacts the adjacency matrix computation using linear regression since from my understanding non-Gaussian distributions violate homoscedasticity.

Any clarity on these two topics would be greatly appreciated!

sshimizu2006 commented 1 year ago

You don't have to take a log transform to make variables more normal. Non-Gaussianty itself does not necessarily violate homoscedasticity (constant variance).

vdemchenko3 commented 1 year ago

Hi,

Thank you for your reply!

What about scaling the data such that all variables are [0,1]? I've ran analyses both with scaling and not scaling finding significantly different DAGs.

sshimizu2006 commented 1 year ago

If you transform your data, the data-generating process will change. That would be the reason you get different results.

vdemchenko3 commented 1 year ago

I see so is the suggestion to not change the data at all (no minmax scaling, no log transforms) before running causal discovery?

sshimizu2006 commented 1 year ago

Well, my point is that it depends on the class of the data generation process you assume.

vdemchenko3 commented 1 year ago

Could you elaborate a bit on that? I'm mostly working with survey-type data where respondents answer various questions.

sshimizu2006 commented 1 year ago

Ok, well, my suggestion is that you can do log transforms if you find that previous works in your field do that, but it would be better not to do minmax scaling.

vdemchenko3 commented 1 year ago

Why is it better not to do minmax scaling?

sshimizu2006 commented 1 year ago

I don't have a strong reason. Just because I don't often see minmax scaling is used in the context of causal discovery. The point is that if you do some transformation and apply LiNGAM for example, it means that you are assuming a linear non-Gaussian model for the transformed data. It is necessary to think about the validity.