ebird / ebird-best-practices

Best Practices for Using eBird Data
https://ebird.github.io/ebird-best-practices/
GNU General Public License v3.0
5 stars 2 forks source link

Compatibility between v1 and v2 #3

Closed jacvig closed 5 months ago

jacvig commented 5 months ago

This is a fantastic resource and I have been using it to teach myself R which was started using my own dataset during v1 of the code. I managed to work up to ch.3 Environmental Variables on v1 before you updated to v2. In order to progress, I had to go back and create a new data frame containing a type variable with test/train as values and I was able to progress to ch.4. But now I have become stuck at 4.6.2 Model Estimates specifically the code line: pred_er <- predict(er_model, data = pred_grid_eff, type = "response")

The error indicates there is a missing variable and I think it is the "type" variable since it is missing from the data frame pred_grid_eff. I have re-run the prediction grid again, but this variable does not come over in the new data frame.

Is this an issue with the code or a compatibility issue between v1 and v2? If the latter, the only solution I can come to is to import a new dataset and start at the very beginning to ensure it follows this code exactly.

Any ideas would be much appreciated. And thanks again for making this available.

mstrimas commented 5 months ago

If you look at the call to ranger() and identify the training dataset (data = checklists_train), all the variables in that data frame will need to be in the dataset you pass to predict() with the exception of the response variable species_observed. You can check which variables are missing with:

setdiff(names(checklists_train), names(pred_grid_eff))

Let me know what you find and I can try troubleshooting from there.

jacvig commented 5 months ago

Thanks, it looks as if there are two variables that are missing: species_observed and duration_minutes. I renamed the covariate effort_hours to duration_minutes: pred_grid_eff <- pred_grid |> mutate(observation_date = ymd("####-##-##"), year = year(observation_date), day_of_year = yday(observation_date), hours_of_day = t_peak, effort_distance_km = 2, duration_minutes = 60,
effort_speed_kmph = 2, number_observers = 1)

It runs now! Thanks, I had caught another mis-named variable but not duration_minutes. I am using an eBird dataset exported in 2022 so the variable names have changed with the upgrade.

I do have another question which might be suitable for a new thread or entirely out of scope. My distribution map appears entirely inconsistent with the encounter map only showing a minute range compared with encounter predictions. Could this be due to a poor prediction model or the fact that my data cover a very scarce species? I am wondering if the habitat covariates classification system is not appropriate to my geographic range/species. In any case, thanks for the troubleshooting.

mstrimas commented 5 months ago

Glad you got the variable name issue sorted out!

Regarding your other question, can you provide some information about the species and region you're working with.

jacvig commented 5 months ago

I'm modelling Willow Tit in the UK. DistributionMap EncounterRateMap

mstrimas commented 5 months ago

This is strange, it's hard to say exactly what's going on without stepping through the code myself. It seems like there must be an issue with how the thresholding is done. For example, is it possible you applied the threshold to the calibrated prediction rather than the raw prediction? You might try turning off the calibration step entirely.

jacvig commented 5 months ago

Thanks for this. I will need to dig around a bit more and see if I can identify the issue. Thanks again.