species_id is NA in Project Feeder Watch Collections

steffilazerte commented 4 years ago

A user reported a problem with downloading PFW data via naturecounts. In the process of dealing with this problem @steffilazerte noticed that species_id were missing from the PFW data:

> nc_data_dl(collection = "PFW", years = 2017, doy = 1, 
             fields_set = "extended", info = "testing bug", 
             username = "steffilazerte") %>%
    dplyr::select(species_id, SpeciesCode, ObservationDate) %>% 
    head()

  species_id SpeciesCode     ObservationDate
1         NA      moudov Jan  1 2017 12:00AM
2         NA      norcar Jan  1 2017 12:00AM
3         NA      whbnut Jan  1 2017 12:00AM
4         NA      haiwoo Jan  1 2017 12:00AM
5         NA      dowwoo Jan  1 2017 12:00AM
6         NA      haiwoo Jan  1 2017 12:00AM

In contrast, other collections do return species_id, for example:

> nc_data_dl(collection = "RCBIOTABASE", species = 14280, years = 2010,
             fields_set = "extended", username = "sample", info = "nc_example") %>%   
    dplyr::select(species_id, SpeciesCode) %>%
    head()

  species_id SpeciesCode
1      14280       14280
2      14280       14280
3      14280       14280
4      14280       14280
5      14280       14280
6      14280       14280

@cjardine-bsc asked @pmorrill for a link between API and database tables and used that to tracked it down to the API: "That suggests that the problem ... is rooted in the API not the DB (if it's pulling from bscakn.bmde_data directly)."

species_id should be non-NA unless there is no observation, as it's an important field.

cjardine-bsc commented 4 years ago

Thanks for the extra details.

It IS a database problem. There are a subset of PFW records missing species_id in bmde_data. I'll look into it.

pmorrill commented 4 years ago

Ha. That's good to know as I was going to report confusion....if it is an API problem. (WT Heck?)

cjardine-bsc commented 4 years ago

yup. looks like PFW changed the species codes they were using which broke the join to the lk_species table during imports to BMDE.

Denis, the PFW codes are closest to the EBIRD1.05 authority, but not exact as there are two codes missing; rustow and amegol.

I'm undecided if we should create a new PFW authority or just use EBIRD1.05. For now I've elected to do user EBIRD1.05 and add the two missing codes. You can let me know if you prefer the other option.

I'm re-importing the PFW data with species_ids now. It will take a little while, but you should see it all there through the R package by tomorrow.

steffilazerte commented 4 years ago

Should this be ready now? I'm actually getting no matches at all...

nc_data_dl(collection = "PFW", info = "testing bug",
           username = "steffilazerte") 

Using filters: collections (PFW); fields_set (BMDE2.00-min)
Collecting available records...
 Error: These collections have no data that match these filters

In fact, replicating the request I made at the start of this issue, with the same request id, gave the confusing results of claiming to download 1299 records, but returning none:

> nc_data_dl(request_id = 156239, fields_set = "extended", username = "steffilazerte") %>%
             dplyr::select(species_id, SpeciesCode, ObservationDate) %>%
             head()
Using filters: collections (PFW); fields_set (BMDE2.00-ext)
Collecting available records...
  collection nrecords
1        PFW     1299
Total records: 1,299

Downloading records for each collection:
  PFW
    Records 1 to 1299 / 1299

[1] species_id      SpeciesCode     ObservationDate
<0 rows> (or 0-length row.names)

cjardine-bsc commented 4 years ago

Not yet, its been taking longer than I thought.

steffilazerte commented 4 years ago

Now I'm getting data, but unfortunately the species_ids are still missing:

nc_data_dl(request_id = 156239, fields_set = "extended", username = "steffilazerte") %>%
  dplyr::select(species_id, SpeciesCode, ObservationDate) %>%
  head()

# Using filters: collections (PFW); fields_set (BMDE2.00-ext)
# Collecting available records...
#   collection nrecords
# 1        PFW     1299
# Total records: 1,299

# Downloading records for each collection:
#  PFW
#      Records 1 to 1299 / 1299
#    species_id SpeciesCode     ObservationDate
#  1         NA      rebnut Jan  1 2017 12:00AM
#  2         NA      whbnut Jan  1 2017 12:00AM
#  3         NA      dowwoo Jan  1 2017 12:00AM
#  4         NA      haiwoo Jan  1 2017 12:00AM
#  5         NA      whbnut Jan  1 2017 12:00AM
#  6         NA      bkcchi Jan  1 2017 12:00AM

denislepage commented 4 years ago

Just cleaning up some messages, but the latest eBird codes here in this table: lk_species_ebird_taxon

I strongly suspect the old PFW data will need to be matched against the older codes, and the newer data with the new codes, unless we ask Cornell for a complete set again.

Unless this remains an unresolved issue, I would probably wait until next summer when they will be expected to send us the last year of data.

It might be worth checking if the codes you used match species_status = 1 in the lookup table (full species). I suspect some id’s have been deprecated and we should avoid using those (species_id = 0, particularly, but you may also have ID’s that refer to splits to lumped species).

BirdsCanada / NatureCountsAPI

species_id is NA in Project Feeder Watch Collections #33