GotelliLab / EcoSimR

Repository for EcoSimR, by Gotelli, N.J. , Hart E. M. and A.M. Ellison. 2014. EcoSimR 0.1.0
http://ecosimr.org
Other
27 stars 10 forks source link

Degenerate matrix error handling #34

Closed emhart closed 9 years ago

emhart commented 9 years ago

Hi @ngotelli you mentioned that we need to check if matrices are singular or not. Do you mean to say that when a matrix is simulated that we need to make sure that it is not singular? Or that a user input isn't singular? I was thinking of using this: http://www.inside-r.org/packages/cran/matrixcalc/docs/is.singular.matrix. However I do worry that checking every simulated matrix will have a computational cost. We'll have to see. Is there anything else we want to check?

ngotelli commented 9 years ago

Hi @emhart

The input and simulated matrices will almost never be square, so we can't use this function to test for singularity of a matrix. Certainly we should screen for input matrices that have empty rows and/or columns (= degenerate matrices), although I can imagine some kinds of simulations in which that might be desirable.

For co-occurrence, all of the algorithms except Sim9 can throw a random degenerate matrix. For sparse matrices, and for algorithms like sim10, which use vectors of weights for marginal probabilities, this could be the most common result. These degenerate matrices are a problem for the C score because if one species is never there, the calculated C score for those pairs is 0, which implies high overlap. So I toss those out before calculating a C-score on a matrix.

I am not sure what to do about this. On the one hand, degenerate matrices can cause problems for casual users who just want to run a null model. On the other hand, degenerate matrices may be needed by those who are doing more advanced simulation work and just want to use the EcoSimR algorithms to create their matrices. And, as you say, there will be some overhead in runtimes if we slow down to check every simulated matrix for empty rows and columns.

In the original EcoSim, the way we handled this was to provide an option for keeping or discarding simulated degenerate matrices. Discarding means that more matrices have to be created, and this can be very slow if degenerate matrices are improbable. Whatever option was chosen the C-score was calculated only for the non-zero rows and columns of each matrix. If you look at my original code for the C-score, you will see there is a line at the start of the function that subsets the matrix in this way before moving into the calculation.

emhart commented 9 years ago

I can easily parameretize this. So I should check in all the sim family of fxns for co-occurrence, but what about the ra1-4 algos for niche overlap?

emhart commented 9 years ago

I realize now after writing the tests for the ra fxn's that they are just reshuffling so there's no need to worry about them, just the sim algo's.

ngotelli commented 9 years ago

Hi @emhart

You are correct, this is not an issue with RA1-4. Those algorithms can generate empty columns, but that is not a problem for the niche overlap indices. I'd like to minimize the amount of checking and error-trapping. That will keep the program running fast, but it will also allow the null matrices to be used for a variety of other purposes that may not be affected by empty rows or columns. So, the simplest thing is to not alter the structure of any matrix, but to exclude empty rows just during the calculation of the following indices:

c_score c_score_var c_score_skew v_ratio

The others will not be affected.

I am working this week to try and finish off the documentation, and then I will assemble a list of features and issues that we may want to address before the initial release.

emhart commented 9 years ago

@ngotelli looks like you may have already taken care of this in your original code. Does this provide the fix? https://github.com/GotelliLab/EcoSimR/blob/master/R/metrics.R#L408

It looks like it excludes empty rows already.

ngotelli commented 9 years ago

Yes, this is resolved now for all the co-occurrence metrics.