BiologicalRecordsCentre / sparta

Species Presence/Absence R Trends Analyses
http://biologicalrecordscentre.github.io/sparta/index.html
MIT License
21 stars 24 forks source link

Indexing problem #217

Closed drnickisaac closed 3 years ago

drnickisaac commented 3 years ago

@AugustT and @mlogie have asked me to look into this problem. In the reproducible example Tom provided I was able to swiftly diagnose the source of the problem: bugs_data$nsite = 8955 but tail(bugs_data$Site) returns a set of integers >8955

drnickisaac commented 3 years ago

Also, the final elements of the region variables are all NA. I suspect this is part of the same problem.

I think the problem emerges at line 344 (in my version) of occDetFunc.r. Here I created an index for the sites. I did this to simplify the old system, which was very confusing and looked prone to error. site_match <- unique(data.frame(name = occDetdata$site, id = as.numeric(as.factor(occDetdata$site)))) from which we get: occDetdata <- merge(occDetdata, site_match, by.x='site', by.y="name") zst <- acast(occDetdata, id ~ factor(TP), value.var = 'focal', max, fill = 0) And later: bugs_data <- with(occDetdata, list(y = as.numeric(focal), Year = TP, Site = id, nyear = nTP, nsite = nrow(zst), nvisit = nrow(occDetdata))) Somehow this allows a situation in which bugs_data$nsite > max(bugs_data$Site)

drnickisaac commented 3 years ago

The code above looks foolproof. There are no bits of code that shorten occDetdata before bugs_data is created. I have explored the behaviour in detail on the test dataset in occDetFunc(), even removing two sites from occDetdata in the process. Possibly the key line would be more elegant/foolproof as: uniqueSites <- unique(occDetdata$site)' site_match <-data.frame(name = uniqueSites, id = 1:length(uniqueSites))` But otherwise it looks pretty watertight to me.

drnickisaac commented 3 years ago

It's watertight because all the filtering to remove dodgy sites (e.g. that appear in multiple regions) from occDetdata happens before the creation of site_match, and there are no changes to either occDetdata or zst after this point in the code.

drnickisaac commented 3 years ago

Ah - I think I see it. I recall @mlogie mentioning a bug caused by the fact that sparta assumes that there is a column in regional_codes called site, but that there is no check of this. In fact, I can see that occDetFunc() first assumes that the first column contains the site identity: https://github.com/BiologicalRecordsCentre/sparta/blob/8451e0140bc51d775c958c4e59d90e0977e2ce1f/R/occDetFunc.r#L280 but then refers to site explicitly: https://github.com/BiologicalRecordsCentre/sparta/blob/8451e0140bc51d775c958c4e59d90e0977e2ce1f/R/occDetFunc.r#L290

drnickisaac commented 3 years ago

I have tracked back through the commit history. I can see that this little bomb has been sitting there, waiting to go off, since at least March 2019. I have implemented the following fix: if(!"site" %in% names(regional_codes)) { warning(paste0("renaming ", names(regional_codes)[1], " as 'site'")) names(regional_codes)[1] <- "site" }

drnickisaac commented 3 years ago

So this should be closed after merging my pull request and testing with the problematic dataset.

AugustT commented 3 years ago

Thanks Nick