BiologicalRecordsCentre / BRCindicators

An R package for creating indicators from trends data
4 stars 11 forks source link

rescale_species fails with zeros in the input data #34

Open drnickisaac opened 5 years ago

drnickisaac commented 5 years ago

@JackHHatfield91 spotted this.

The rescale_species function has a min and max argument to stop extreme values exerting undue influence on the geometric mean. We always assumed this would address problems with zeros in the input data. However, this is not true, because min and max are applied after the index values have been rescaled.

  # Get the multipliers neede to achieved the index value
  multipliers <- index / Data[1,2:ncol(Data)] 

  # Apply these multipliers
  indicator_scaled <- t(t(Data[,2:ncol(Data)]) * multipliers)

  # Make values over max == max, and < min == min
  indicator_scaled[indicator_scaled < min & !is.na(indicator_scaled)] <- min
  indicator_scaled[indicator_scaled > max & !is.na(indicator_scaled)] <- max

Cases where Data==0 result in mulitplier taking the value Inf, which means indicator_scaled becomes NaN, which is not captured by the conversion to either min or max statements. Further, when Data==0 in the first year, all subsequent years get first rescaled to Inf then capped at max.

We can't solve this using the min and max statements, since these values are relative to the index. Using them to test the raw data would result in unintended consequences (e.g. for occupancy data, all the input data would be below the default value of min=1. Instead, we need two separate fixes:

AugustT commented 5 years ago

Yes, I see that a 0 in the first year is going to cause some problems!

Really this should be generalised to say that any time series starting with a 0 is a problem (i.e. 0,1,2,3 is problematic as is NA,NA,0,1). I think if you take this more general approach and change all 'leading zeros' to NAs then the zeros in the middle of sequences are no longer an issue? I believe once all the leading zeros are removed all the mid-series zeros will be multiplied by the multiplication factor (therefore still zero), and maybe capped to the min. So I don't see a problem there, but I might have overlooked something.

larspett commented 5 years ago

If you have a leading zero in e.g. a butterfly monitoring dataset for TRIM analysis then there must be prior information around, otherwise it would be NA until the first observation has been made (i.e. the 0 means that there was information available saying that that species was there but wasn’t found, am not sure what the model assumptions say about situations like that.) My understanding would be that leading zeroes are not possible before something has been observed so both 0,1,2,0,5 and NA,0,1,2,0,5 would be NA,1,2,0,5 & NA,NA,1,2,0,5. The zero inside the series is meaningful, though.

drnickisaac commented 5 years ago

Thanks both. I agree with your logic and suggestion for a fix, @AugustT @larspett : I agree with you that leading zeros are typically informative. In the application I'm working on, leading zero arise because I have set the baseline year for the multispecies indicator to be later than the start date for some of the contributing species. Specifically, I'm combining data from multiple taxonomic groups, and I want the indicator to start at the first year in which all groups are present (thus discarding all data prior to this point). It so happens that some species have a true zero in the baseline year I've chosen. I'm taking the path of least resistance by switching those zeros to NA, but I accept that potentially smarter solutions could be developed.