USFWS / fwspp

Query species occurrence observations on USFWS properties
Creative Commons Zero v1.0 Universal
1 stars 0 forks source link

datecollected vs eventDate #56

Open mgaynor1 opened 7 months ago

mgaynor1 commented 7 months ago

Currently, ridigbio returns datecollected by default, which we do not recommend to be used in scientific research. When a data provider does not provide a full date in the Darwin Core eventDate field, this complete value or the missing parts (i.e., month and/or day) are randomly generated and thus may lack any real meaning. The generated dates are difficult to detect, as they are randomly distributed. We are currently working to modify our ingestion pipeline to avoid randomly generating dates. However, dates remain an issue across biodiversity aggregators and the solution is not clear (see GBIF for example).

Why does this matter for fwspp? I found that datecollected is used by this repository as if it was a real value. This may lead to artificial dates being used to make management decisions!

How to use other fields: We plan to update the ridigbio package to instead return "data.dwc:eventDate", "data.dwc:year", "data.dwc:month", and "data.dwc:day" - which are all text fields, rather than dates. These fields are not randomly generated, instead the values are directly from data providers therefore they may provide meaning in biological research. See current issue and pull request.

Since this package currently downloads "all" fields, I hoped this solution might be only related to your clean_iDigBio function and not to your get_iDigBio function. Sadly, all fields aren't returned when "all" fields are specified. Instead, you will need to specify what fields you need to download. From your code, I believe you all want scientificname, lat/lon, coordinate uncertainty, catalognumber, UUID, and date. To obtain these fields, this is how you would modify the download:

fields2get <- c("data.dwc:scientificName",  
                           "data.dwc:decimalLatitude",   
                           "data.dwc:decimalLongitude",
                           "data.dwc:coordinateUncertaintyInMeters",  
                           "catalognumber",
                           "uuid", 
                           "data.dwc:eventDate", 
                           "data.dwc:year", 
                          "data.dwc:month", 
                          "data.dwc:day" )
 idb_recs <- try_idb(type = "records", mq = FALSE, rq = rq,  fields = fields2get,
                        max_items = 100000, limit = 0, offset = 0, sort = FALSE,
                        httr::config(timeout = timeout))

Additional modification to clean_iDigBio will also be needed since the date downloaded here will not be in date format - instead, all dates will be text strings. There are many ways to convert these to dates, for example, see gatoRs remove_duplicate function or ridigbio proposed solution here.

Hope this helps and please let me know if you have any questions or want more specific code suggestions.