dmey / synthia

📈 🐍 Multidimensional synthetic data generation with Copula and fPCA models in Python
https://dmey.github.io/synthia
MIT License
53 stars 9 forks source link

Review: Copula distribution usage and examples #24

Closed mnarayan closed 3 years ago

mnarayan commented 3 years ago

Your package offers support for simulating vine copulas. However, I don't see examples demonstrating how to simulate data from a vine copula given desired conditional dependency requirements.

Is this possible with the current API? If not, how would I use the vine copula generator to achieve this?

Otherwise, can examples show the difference between simulating Gaussian and vine copulas? I only see examples for the Gaussian copula.

tnagler commented 3 years ago

Hey @mnarayan: I think having an example with vines in the docs is a good idea.

I'm not entirely sure what you mean exactly by "conditional dependency requirements". Most likely it's not possible in this library, because it's mainly intended as an off-the-shelf solution and intentionally automates away some of the nitty-gritty. If you need more fine grained control, consider using pyvinecopulib directly (and don't hesitate to contact me or @tvatter if you need help).

mnarayan commented 3 years ago

I see, yeah an example for vines would be great.

One important application for multivariate simulations that comes to mind is vine copula graphical models. I was thinking this could be used analogous to the Gaussian case where one can easily simulate multivariate data by enforcing a desired sparsity/non-zero pattern in the inverse covariance or its Cholesky decomposition. But if I understand you correctly, this would only be possible with the pyvinecopulib not this function?

Is this toolbox then primarily designed to simulate data that mimics distribution of an actual dataset provided? As opposed to being a toolbox for simulating multivariate copula distributions. If so, I think this could perhaps be made clearer in the statement of use.

dmey commented 3 years ago

@mnarayan I have update the examples and move the multivariate copula example to a different section (please see https://dmey.github.io/synthia/examples/multivariate-vine.html). Does this make thinks clearer? This is a very simple example but I would be interested in adding more examples in the near future. Would you have an open dataset in mind that you were looking to explore with copulas? It'd be good to showcase different examples from different fields...

With regard to your other question, yes the sole purpose of Synthia is to generate data that mimics the distribution of the observations. I have update the paper but please let me know if you still find this confusing in the paper or the repo and will fix it.

mnarayan commented 3 years ago

Thanks the vine example looks good. I don't have an open dataset off the top of my head but feel free to reach out to me separately at manjari@alumni.rice.edu if that is this something you would like to explore in the fuure.

yes the sole purpose of Synthia is to generate data that mimics the distribution of the observations.

I couldn't glean this from the first paragraph of the JOSS paper. https://github.com/openjournals/joss-papers/blob/joss.02863/joss.02863/10.21105.joss.02863.pdf

In the fields I have worked in "synthetic" data is often synonymous with simulations/artificial data of any kind. Not one that specifically mimics distributional properties of a given dataset. So even if it feels super obvious/redundant to you, that is worth spelling out. Say copula distributions can mimic both first order and complex second order + higher order dependencies between variables. and is thus very useful for preserving/mimicing properties of a real dataset. It will be useful for any field where the actual continuous dataset cannot be made available or where one wants to conduct a realistic but hypothetical power analysis and so forth.

dmey commented 3 years ago

Thanks and I see. I have rephrased the fist sentence and defined what we mean by synthetic data to avoid confusion (please see https://github.com/openjournals/joss-papers/blob/joss.02863/joss.02863/10.21105.joss.02863.pdf). Would that work with you? Also, are you happy to close #23?