byoungman / evgam

evgam: An R package for generalised additive extreme value models
GNU General Public License v3.0
4 stars 0 forks source link

Fitting 'evgam' model with 'pp' #1

Open lbelzile opened 1 year ago

lbelzile commented 1 year ago

Formula 5 in the JSS paper shows that the inhomogeneous Poisson point process likelihood coded in evgam is (essentially) that of the r-largest order statistics, but the documentation suggests we could use this for pure threshold exceedances (taking threshold $u$ rather than an order statistic, and keeping a single block of $r=n_u$ observations, where $n_u$ is the number of values that exceed the threshold.

I am confused by the documentation.

Arguments for the point process model are given by pp.args. An integer r specifies the number of order statistics from which the model will be estimated. If r = -1, all data will be used. The character string id specifies the variable in data over which the point process isn't integrated; e.g. if a map of parameter estimates related to extremes over time is sought, integration isn't over locations. The scalar nper number of data per period of interest; scalar or integer vector ny specifies the number of periods; if length(ny) > 1 then names(ny) must ne supplied and must match to every unique id. logical correctny specifies whether ny is corrected to adjust proportionally for data missingness.

In the code of the internal function .setup.pp.data, there is a subset argument ds <- split(data, data[, pp$id]) which fails if the list pp.args does not contain id (which makes no sense if there was no covariate) and can return nonsense if id is a vector of numeric or integers (as it is possible to subset columns that don't exist).

The code also returns errors with threshold=0 if r is not provided, stating that there is no data...

For what it's worth, I can mimic the inhomogeneous Poisson point process likelihood by creating a data set consisting only of exceedances with r=-1, a dummy id with a single value.

lbelzile commented 1 year ago

The function also returns a single observation in the data component of the list, even with r=-1 (but it does use all of the data for fitting...)

lbelzile commented 1 year ago

@byoungman Is there a chance we could get GAM modelling for the inhomogeneous Poisson point process in a future version of evgam? This is oftentimes a more stable parametrization than the generalized Pareto model.

byoungman commented 1 year ago

Hi. I tried to respond to this a few weeks ago by replying to an email. Did it arrive? If not, I'll respond here.

lbelzile commented 1 year ago

'I'm happy to look into this. I found it tricky working out how to robustly incorporate the integral in the inhomogeneous Poisson point process likelihood in a user-friendly way. Did you have anything in mind? I am also author of the ppgam package, by the way, which fits inhomogeneous Poisson point processes, and is perhaps one approach to handling the integral, and could be transferred to evgam.

Hum, it seems I had overlooked the question of the intensity measure for the nonstationary case (nobody ever mentions this challenge when recommending using the inhomogeneous Poisson process formulation...) Casson and Coles (1999) circumvented this altogether using a conditional independence assumption, but that just mean they ignored the problem...

What exactly is the domain we have to integrate over? The IPP intensity form without covariate is simply a function of $y$, but now this must be GEV for $y \mid \boldsymbol{x}$ so we need to approximate the integral over both $\boldsymbol{x}$ and $y$.

The case I have in mind is mostly for models with covariates, although space-time extensions could also be considered.

From what I gather from section 3.4.1 of your Environmetrics paper, ppgam uses quadrature methods. I wonder how this scale when $x$ is high-dimensional, but there are only a few exceedances. This would be however be an option, which has the advantage that it would probably be more straightforward to implement. One could always check the numerical accuracy in simple examples.

My understanding is that the integral needs to be over all the covariate dimensions, which is approximated by quadrature in ppgam. You're right, though, that this typically won't scale well for high dimensions. However that's partly dependent on how smooth estimates are expected to be as this determines how many quadrature integration nodes are needed to give a good approximation.

Do you have a model in mind that you'd like to fit? Perhaps if we can find a way to get that working fairly nicely, then some useful generalisations might become apparent. (I could perhaps even borrow some code from ppgam to save some time.)

Cheers,

Ben.

Categorical variables are easiest to handle: we only need their frequency and we get the same intensity function conditional on them. How about starting with a simple model with a binary categorical and a continuous covariate? We could check the accuracy by Monte Carlo simulations.