Biohackaton 2021 - Team investigating recorder behaviour on iNaturalist
The goal is the create metrics for a list of 'mean' observers. We created a gbif export for research grade inaturalist observations, in spain, since 2016. Then filtered on observers with atleast 100 observations since then.
The export is here: GBIF.org (09 November 2021) GBIF Occurrence Download https://doi.org/10.15468/dl.svwnrp
This results in 461 observers meeting this criteria.
Make a directory called data_inat
in the main repo directory and use aws command line tool (https://aws.amazon.com/cli/) to download data:
mkdir data_inat
cd data_inat
aws s3 cp s3://inaturalist-open-data/observations.csv.gz observations.csv.gz --no-sign-request
aws s3 cp s3://inaturalist-open-data/taxa.csv.gz taxa.csv.gz --no-sign-request
The data_inat
directory is git ignored so will not be committed/pushed. This data is currently not used in the workflow.
This part of the process is run in different code chunks in https://github.com/AugustT/BioHack_iNat/tree/main/2_Generating_iNat_RM_Baseline
From the 461 observers from GBIF - we don't actually have their usernames, we have a recordedBy
column which is a mix of names and usernames. Therefore we need to resolve these to iNaturalist usernames. We can do this by using the endpoint: https://api.inaturalist.org/v1/docs/#!/Users/get_users_autocomplete
If multiple users are found by this method we currently just drop those users. This leaves us with 319 users.
Next we make a query to the iNaturalist API https://api.inaturalist.org/v1/docs/#!/Observations/get_observations but it is wrapped in the get_inat_obs_user()
function in R package rinat
https://cran.r-project.org/web/packages/rinat/index.html. However in order to filter by location and species I have edited the function to access extra bit of the query, which for plants in Spain is "&taxon_id=47126&place_id=6774"
.
This query is run for all users with a maximum number of records as 5000 - some observers do have over 5000 records so this is just an arbitrary limit whilst we're hacking. This results is then saved as a dataframe as a .rds
object in https://github.com/AugustT/BioHack_iNat/tree/main/2_Generating_iNat_RM_Baseline/data/user_obvs
These are then all loaded in and combined into one dataframe using
all_observations_raw <- paste0("2_Generating_iNat_RM_Baseline/data/user_obvs/",list.files(path = "2_Generating_iNat_RM_Baseline/data/user_obvs", pattern = ".rds")) %>%
map(readRDS) %>%
bind_rows()
Appropriate columns are selected:
id
- the observation idrecorder
- the recorder usernamespecies
- the species scientific namedate
- the date in YYYY-MM-DD format and converted to date using as.Date()
long
- decimal longitudelat
- decimal latitudequality_grade
- not essential for recorder metrics but possible usefulWe also derive these columns
species_level
- is the record identified to species level? (determined in a hacky was by checking if there's a space in the species
columncountry
- country determined by using function coords2country()
function defined in coords2country.R
, then used to remove any records that are not from the target countrykm_sq
- the km square (stored as character) derived by converting points to spanish national grid and rounding to nearest 1000 metresThis file is then saved to 2_Generating_iNat_RM_Baseline/data/observations/baseline_observations_spain_plants.rds
and is ready for plugging into recorderMetrics package.