DanielBok / copulae

Multivariate data modelling with Copulas in Python
https://copulae.readthedocs.io/en/latest/
MIT License
143 stars 26 forks source link

Requirement to Fit Input Data #38

Closed njalex22 closed 3 years ago

njalex22 commented 3 years ago

Hello, based on my understanding of copulas and experience using them in other languages, I believe the input data must be uniformly distributed before it can be fit to a copula. This is specified in definition 1, point 2 here. This is appears to be consistent with your example, where you fit a gaussian copula to data and state that the function will internally convert the data to psuedo observations here. I have a few questions:

  1. What exact method is being used to conduct the probability integral transform to convert the data to pseudo observations and where can I find this in your code? I assume you are using the emprical CDF, but I wanted to be sure.
  2. If I choose another distribution for the probability integral transform and the input data is already uniformly distributed, will the function further modify the input data in any way?
  3. You appear to still be checking that the input data is uniformly distributed here. Does this check occur after the internal conversion to pseudo observations?

Thank you!

DanielBok commented 3 years ago

Hello,

  1. To the first question, we use ranking. The basic idea is to order the data from 1...N and then divide by N. See the code snippets
  2. If I understand this question correctly, it wouldn't. Assume you had a data that you drew from some distribution and you did the integral transform (convert it to between 0 and 1), then it's result will be roughly the same.
    • You can try the following (pseudo-code here!):
      1. Generate some random variables. norm.random(n=1000)
      2. Get the cdf (integral transform) of it norm.cdf(rvs)
      3. assert cdf == psuedo_obs(rvs)
  3. Not all copula classes converts the inputs to pseudo-obs. I believe empirical copulas don't. See Issue 21. The other copulae should convert data to pseudo-obs when they receive it because it generally doesn't hurt (from experience, most researchers forget to pass in the integral transform as inputs and just put in the raw data) and doesn't cost too much computation (done only once).
    • The general principle is that I try my best to choose sane defaults and abstract certain data transformations so researchers can have less stuff to consider in their head. It's possible to tweak any settings you'd like.

Hope this addresses your questions

njalex22 commented 3 years ago

Yes, that is helpful thanks! I may have another question or two as I continue to test this package, but this leaves me in a good spot right now.