EDIorg / ecocomDP

A dataset design pattern and R package for ecological community data.
https://ediorg.github.io/ecocomDP/
Other
32 stars 13 forks source link

script for long-to-wide for primary tables #51

Closed mobb closed 3 years ago

mobb commented 5 years ago

To help make datasets in this model easy to use, we should put the three primary tables together as a single wide dataset. Details TBD. will need to include the ids so ancillary data could be added on (by the user, ad hoc)

mobb commented 5 years ago

suggested by @vanderbi

clnsmth commented 4 years ago

Archive this wide dataset within an ecocomDP data package @mobb and @vanderbi? How about providing this support as a function in the aggregate/reuse ecocomDP function set?

vanderbi commented 4 years ago

I looked at a couple of ecocomDP datasets yesterday to see how easy it would be for me to use these data for the ILTER biodiversity project. I believe I would find the datasets baffling if I didn't know how they were supposed to fit together. I suppose a baffled scientist could then download the data and fish around in the functions to figure out how to make something more understandable, but I doubt many would go that far.

mobb commented 4 years ago

I looked at a couple of ecocomDP datasets yesterday to see how easy it would be for me to use these data for the ILTER biodiversity project. I believe I would find the datasets baffling if I didn't know how they were supposed to fit together. I suppose a baffled scientist could then download the data and fish around in the functions to figure out how to make something more understandable, but I doubt many would go that far.

Kristin's comment related to #52

mobb commented 4 years ago

the DwC-archive format used by gbif is one of the wide candidates. the simplest format is occurrence-core (one table, denormalized). But a better fit for most of our data is event-core, which is semi-normalized.

mobb commented 4 years ago

These are the things we expect scientists to need

mobb commented 3 years ago

Copying from duplicate issue, #95 Mostly affects the functions in manipulate_tables.R. These join the required tables (obs, loc, taxon), with the loc table un-nested, and add lat log to each line (most detailed loc available), so that each row includes (min)

datetime, taxon, site-name, lat, lon, and any variables.

Actions still to decide

  1. Leave in L1 identifiers? there are 4: observation_id, event_id, site_id, taxon_id. we added the observation and event ids, but taxon_id (and possibly site_id) come from the L0 file, and we add them if none were present.
  2. pivot values? scientists seem generally more comfortable with wide tables, rather than long. We may not need the wide table for GBIF (still TBD what we send to them)
mobb commented 3 years ago

Another idea, per @vanderbi : a wide table could be a L2 datasets, that we make along with L1. If so, it would need to contain as much as possible of the L0. still to decide, however is how much pivot do we do, how much to leave to users. e.g., with all pivots, we might be going back to L0.

clnsmth commented 3 years ago

flatten_data() does this.