Closed drnickisaac closed 3 years ago
Also, the final elements of the region variables are all NA. I suspect this is part of the same problem.
I think the problem emerges at line 344 (in my version) of occDetFunc.r
. Here I created an index for the sites. I did this to simplify the old system, which was very confusing and looked prone to error.
site_match <- unique(data.frame(name = occDetdata$site, id = as.numeric(as.factor(occDetdata$site))))
from which we get:
occDetdata <- merge(occDetdata, site_match, by.x='site', by.y="name")
zst <- acast(occDetdata, id ~ factor(TP), value.var = 'focal', max, fill = 0)
And later:
bugs_data <- with(occDetdata, list(y = as.numeric(focal), Year = TP, Site = id, nyear = nTP, nsite = nrow(zst), nvisit = nrow(occDetdata)))
Somehow this allows a situation in which bugs_data$nsite > max(bugs_data$Site)
The code above looks foolproof. There are no bits of code that shorten occDetdata
before bugs_data
is created.
I have explored the behaviour in detail on the test dataset in occDetFunc()
, even removing two sites from occDetdata
in the process. Possibly the key line would be more elegant/foolproof as:
uniqueSites <- unique(occDetdata$site)'
site_match <-data.frame(name = uniqueSites, id = 1:length(uniqueSites))`
But otherwise it looks pretty watertight to me.
It's watertight because all the filtering to remove dodgy sites (e.g. that appear in multiple regions) from occDetdata
happens before the creation of site_match
, and there are no changes to either occDetdata
or zst
after this point in the code.
Ah - I think I see it. I recall @mlogie mentioning a bug caused by the fact that sparta assumes that there is a column in regional_codes
called site
, but that there is no check of this.
In fact, I can see that occDetFunc()
first assumes that the first column contains the site
identity:
https://github.com/BiologicalRecordsCentre/sparta/blob/8451e0140bc51d775c958c4e59d90e0977e2ce1f/R/occDetFunc.r#L280
but then refers to site
explicitly:
https://github.com/BiologicalRecordsCentre/sparta/blob/8451e0140bc51d775c958c4e59d90e0977e2ce1f/R/occDetFunc.r#L290
I have tracked back through the commit history. I can see that this little bomb has been sitting there, waiting to go off, since at least March 2019. I have implemented the following fix:
if(!"site" %in% names(regional_codes)) {
warning(paste0("renaming ", names(regional_codes)[1], " as 'site'"))
names(regional_codes)[1] <- "site"
}
So this should be closed after merging my pull request and testing with the problematic dataset.
Thanks Nick
@AugustT and @mlogie have asked me to look into this problem. In the reproducible example Tom provided I was able to swiftly diagnose the source of the problem:
bugs_data$nsite
=8955
buttail(bugs_data$Site)
returns a set of integers >8955