CornellLabofOrnithology / ebird-best-practices

Best Practices for Using eBird Data
https://CornellLabOfOrnithology.github.io/ebird-best-practices/
Other
32 stars 12 forks source link

error with format_unmarked_occu #8

Closed lime-n closed 4 years ago

lime-n commented 4 years ago

I am trying to prepare a dataframe of species observation for abundance and occurrence data using habitat covariates.

When using this code:

library(auk)
occ_wide <- format_unmarked_occu(occ, 
                                 site_id = "site", 
                                 response = "species_observed",
                                 site_covs = c("n_observations", 
                                               "latitude", "longitude", 
                                               "pland_00_water",
                                               "pland_11_wetland",
                                               "pland_12_cropland",
                                               "pland_13_urban"),

                                 obs_covs = c("time_observations_started", 
                                              "duration_minutes", 
                                              "effort_distance_km", 
                                              "number_observers", 
                                              "protocol_type",
                                             "pland_11_wetland"))

I get this error:

Error in format_unmarked_occu(pract, site_id = "site", response = "species_observed", : Site-level covariates must be constant across sites

What does this mean and how can I overcome this error?

a reproducible code:

structure(list(site = c("L10018668_obs439702_2020", "L10018668_obs439702_2020", 
"L10018668_obs439702_2020", "L10018668_obs439702_2020", "L10024459_obs1462591_2020", 
"L10024459_obs1462591_2020"), closure_id = c("2020", "2020", 
"2020", "2020", "2020", "2020"), n_observations = c(4L, 4L, 4L, 
4L, 6L, 6L), checklist_id = c("S62823384", "S62823384", "S62823384", 
"S62823384", "S62830871", "S62830871"), observer_id = c("obs439702", 
"obs439702", "obs439702", "obs439702", "obs1462591", "obs1462591"
), sampling_event_identifier = c("S62823384", "S62823384", "S62823384", 
"S62823384", "S62830871", "S62830871"), scientific_name = c("Calidris canutus", 
"Calidris canutus", "Calidris canutus", "Calidris canutus", "Calidris canutus", 
"Calidris canutus"), observation_count = c(0, 0, 0, 0, 0, 0), 
    species_observed = c(0L, 0L, 0L, 0L, 0L, 0L), state_code = c("AU-VIC", 
    "AU-VIC", "AU-VIC", "AU-VIC", "AU-NSW", "AU-NSW"), locality_id = c("L10018668", 
    "L10018668", "L10018668", "L10018668", "L10024459", "L10024459"
    ), latitude = c(-37.0209359, -37.0209359, -37.0209359, -37.0209359, 
    -34.785917, -34.785917), longitude = c(145.1458832, 145.1458832, 
    145.1458832, 145.1458832, 150.750221, 150.750221), protocol_type = c("Stationary", 
    "Stationary", "Stationary", "Stationary", "Traveling", "Traveling"
    ), all_species_reported = c(TRUE, TRUE, TRUE, TRUE, TRUE, 
    TRUE), observation_date = structure(c(18262, 18262, 18262, 
    18262, 18262, 18262), class = "Date"), year = c(2020, 2020, 
    2020, 2020, 2020, 2020), day_of_year = c(1, 1, 1, 1, 1, 1
    ), time_observations_started = c(8.71666666666667, 8.71666666666667, 
    8.71666666666667, 8.71666666666667, 14.8166666666667, 14.8166666666667
    ), duration_minutes = c(30, 30, 30, 30, 59, 59), effort_distance_km = c(0, 
    0, 0, 0, 0.805, 0.805), number_observers = c(1, 1, 1, 1, 
    1, 1), pland_00_water = c(NA, NA, NA, NA, 0.032258064516129, 
    NA), pland_01_evergreen_needleleaf = c(NA, NA, NA, NA, NA, 
    NA), pland_02_evergreen_broadleaf = c(NA, NA, NA, NA, NA, 
    0.129032258064516), pland_03_deciduous_needleleaf = c(NA, 
    NA, NA, NA, NA, NA), pland_04_deciduous_broadleaf = c(NA, 
    NA, NA, NA, NA, NA), pland_05_mixed_forest = c(NA, NA, NA, 
    NA, NA, NA), pland_06_closed_shrubland = c(NA, NA, NA, NA, 
    NA, NA), pland_07_open_shrubland = c(NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_), pland_08_woody_savanna = c(NA, 
    NA, NA, NA, NA, NA), pland_09_savanna = c(NA, NA, NA, NA, 
    NA, NA), pland_10_grassland = c(NA, NA, NA, NA, NA, NA), 
    pland_11_wetland = c(NA, NA, NA, NA, NA, NA), pland_12_cropland = c(NA, 
    NA, NA, NA, NA, NA), pland_13_urban = c(NA, NA, NA, 0.125, 
    NA, NA), pland_14_mosiac = c(NA, NA, NA, NA, NA, NA), pland_15_barren = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), id = c(57262, 
    57262, 57262, 57262, 85293, 85293), elevation_median = c(150.127420697893, 
    150.127420697893, 150.127420697893, 150.127420697893, 17.1925210271563, 
    17.1925210271563), elevation_sd = c(5.25428441050561, 5.25428441050561, 
    5.25428441050561, 5.25428441050561, 7.86367063800502, 7.86367063800502
    )), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
mstrimas commented 4 years ago

In this case the error is exactly what the error message says, all the variables you have listed for site_covs need to be constant across sites. That's not happening in your case, e.g.

select(occ, site, pland_00_water)
# site                      pland_00_water
# <chr>                              <dbl>
# 1 L10018668_obs439702_2020         NA     
# 2 L10018668_obs439702_2020         NA     
# 3 L10018668_obs439702_2020         NA     
# 4 L10018668_obs439702_2020         NA     
# 5 L10024459_obs1462591_2020         0.0323
# 6 L10024459_obs1462591_2020        NA  
lime-n commented 4 years ago

Any suggestions on how I can overcome this?

mstrimas commented 4 years ago

Since these are from exactly the same location in the same year they should have the same covariates. The fact that they don't suggests something must have gone wrong during covariate assignment, but I have no idea what could have happened here.

mstrimas commented 4 years ago

It could be worth checking if it's always happening with NA values or if you have some cases where two sites have different non-NA values.

lime-n commented 4 years ago

I believe it may be when I read the csv into r because I get this error

Warning: 317279 parsing failures.
  row                           col           expected              actual                                file
 2523 pland_08_woody_savanna        1/0/T/F/TRUE/FALSE 0.1935483870967742  'data/pland-elev_location-year.csv'
 2524 pland_08_woody_savanna        1/0/T/F/TRUE/FALSE 0.26666666666666666 'data/pland-elev_location-year.csv'
44813 pland_01_evergreen_needleleaf 1/0/T/F/TRUE/FALSE 0.4666666666666667  'data/pland-elev_location-year.csv'
44814 pland_01_evergreen_needleleaf 1/0/T/F/TRUE/FALSE 0.09375             'data/pland-elev_location-year.csv'
44815 pland_01_evergreen_needleleaf 1/0/T/F/TRUE/FALSE 0.53125             'data/pland-elev_location-year.csv'
..... ............................. .................. ................... ...................................
See problems(...) for more details.

Some of the covariates have a mixture of values, some only mention TRUE, whilst others have integer values and some only NAs.

I wasn't experience any errors during the process of doing the code so I find this confusing.

lime-n commented 4 years ago

It may have been because I changed this code:

pland <- pland %>% 
  pivot_wider(names_from = lc_name, 
              values_from = pland, 
              values_fill = list(pland = 0))

as it was not working, it returned this error:

Error: Can't convert to . Run rlang::last_error() to see where the error occurred. In addition: Warning message: Values are not uniquely identified; output will contain list-cols.

Use values_fn = list to suppress this warning.
Use values_fn = length to identify where the duplicates arise
Use values_fn = {summary_fun} to summarise duplicates

to this:

pland <- pland %>%
  group_by(lc_name) %>%
  mutate(row = row_number()) %>%
  tidyr::pivot_wider(names_from = lc_name, values_from = pland) %>%
  select(-row)

right before writing it into .csv form.

could be that I missed including this code values_fill = list(pland = 0)) into the other.

lime-n commented 4 years ago

After a long session of uploading all the code, whilst the issue I mentioned has now fixed any warnings from occuring during parsing, I still get back the same error about the same covariates. Thankfully, there are no more NA values and these have been filled with 0.

I believe the problem lies within the pland code and that there are duplicate entires. I have confirmed this with values_fn = list(pland = length, with some columns returning 2 as opposed to 1 or 0.

Is there a way to summarise the duplicates so they only return 1?

lime-n commented 4 years ago

When I look for the frequency of duplicates, it shows this:

> head(data.frame(table(occ$site)))
                       Var1 Freq
1 L10000468_obs1252332_2019    4
2  L10000750_obs132896_2019   10
3 L10001060_obs1162224_2019    8
4  L10001830_obs476367_2019    3
5  L10002157_obs163161_2019   10
6  L10002592_obs500379_2019   10

Would deleting the frequencies in which they occur more than one work, or would I be losing valauble data?

mstrimas commented 4 years ago

I'm not sure what's going on here, there's clearly some issue with the data processing, but it's hard to say what it is without a concise reproducible example. I am fairly certain there isn't any issue with the format_unmarked_occu() function though, and at the moment I don't have time to look into this further. I will come back to this if I do manage to have some spare time.

lime-n commented 4 years ago

I have uploaded all the files necessary to reproduce the error on my Github here: https://github.com/lime-n/ebird_data

It may be best to read occ.csv only.

I am working with 9 different species, and this error occurs for 7 out of the 9. Whilst following the code for covariates 'as is' without error.

I have tried looking on stackexchange and cannot seem to figure out the problem, maybe you will have better luck.

I have found it to work with:

library( data.table )
setDT(mydata)[ !duplicated( site, fromlast = TRUE ), ]

However it removes up to 10,000 rows of data. Which may skew my analysis when my other two data work as is ad hoc.

An extra problem with this approach is that it produces negative values for the occ_model, so occ_gof cannot run.

Found that the problem is with the longitudinal coordinates. Not sure how it got there as it works perfectly well for some other species whilst following the exact same code. Must be an ebird error?

lime-n commented 4 years ago

It seems that writing the dataframe into a csv, and then loading it back into the R environment helps solve the issue. How strange?

mstrimas commented 4 years ago

That is strange! I don't foresee myself having time to look into this further in the near future, but seems like you may have figured out a solution.

lime-n commented 4 years ago

When writing the .csv, I have noticed that it introduces another column which denotes for values each row number. Seems it was the magic column to fixing the problem.

mstrimas commented 4 years ago

Hmm, that seems very odd to me, this shouldn't fix the issue. Are you sure it's not preventing the error but causing an incorrect result? In general, I suggest always using write_csv() from the readr package, which doesn't add this extra column, you almost never want to save the row number.