Closed dcarver1 closed 7 months ago
After an initial review, ParseGBIF seems to have some value. We'd already want to be writing some level of functionality for checking duplicates best we can.
It requires a manual download link to start its correction pipeline. That would have to be changed as we are currently trying to avoid users needing to create a GBIF account to download data, but we could fork ParseGBIF and alter that data import path to start with a GBIF taxonID instead.
Additionally there are other steps we would want to alter, for example we trust our users to know exactly what taxa they are looking for. So automatic validation checks against the World Checklist of Vascular Plants is not helpful.
I'll get started on figuring out if we can alter the data import path without forking the entire library.
Upon further investigation, although ParseGBIF will not easily integrate into this project as a simple library import it can still be used as inspiration for how to clean GBIF data. I am currently writing some simple rgbif api calls that integrate some low-level cleaning the ParseGBIF naturally performs.
@dcarver1, we should talk with the partners about what "issues" are always unacceptable that we pre-filter out of GBIF occurrence data before the user has access to start manually spot-checking. These issues are listed here with some explanations/examples, but I need to find a specific code:value list because the issue codes are not these easy to understand strings found in the link below.
Found the list of issues and codes, I think we should present these to the development partners (most specifically Collin) and determine which issues would NEVER be acceptable for a GAP analysis.
I think I've absorbed all I can from ParseGBIF. It won't directly integrate into this application, but if I use any of it's validation I'll make sure to properly cite them as the inspiration. I may use the core concepts behind some of their duplication checking code in future if we desire to have that automated, but I believe that is a 'later feature'.
Evaluate the parseGBIF r library
https://github.com/pablopains/parseGBIF?tab=readme-ov-file https://www.nature.com/articles/s41598-024-56158-3
What value does this add over the standard rgbif library?
There are multiple other data cleaning library reference in the publication. Do any of these seem to bring a good value to the data generation workflow?
Any other notes and ideas