Acanthiza / envClean

Clean biological data from large unstructured dataset(s)
https://acanthiza.github.io/envClean/
Other
0 stars 1 forks source link

Conceptual rework of envClean #17

Open Acanthiza opened 7 months ago

Acanthiza commented 7 months ago
  1. bio_all start (x rows, n columns (cols defined in data_map))
  2. add_bins (taxonomic (rather than use filter_taxa, left_join to taxa$lutaxa)), geographic (add_raster_cell) and temporal (year)) (x rows, m (> n) columns)
    • save this as clean_start (or possibly it is made in envOcc)
    • there is no reason a few different bins couldn't be added here (e.g. 30 m, 90 m, 1 km, month, season, taxonomic hierarchy)
    • although it'd get unwieldy having such a wide data set. Perhaps have bin lookups? (taxa$lutaxa already does that).
    • This is starting to sound like some form of database... possibly via arrow?
  3. non-dependent filters (absences, state/city centoids, geographic range, date range) (y (< x) rows, m columns)*
  4. unique bins (z (< y) rows, cols = context cols (bin cols))
  5. reduce each non-context attribute (via make_attribute) required in each bin (each with z rows, bins + 1 cols)
    • I think this could be done for each required attribute, not sequentially, and then brought together at the end (as the bins define unique contexts, so left_joining them all together should result in y rows)
    • So you'd have, per bin, the best guess at each attribute
    • reduce_geo_rel exception here but same concept (exception as it isn't run via make_attribute due to the over rides. Consider generalising reduce_geo_rel as a template for all make_attribute. e.g. say, include over_ride concept in make_attribute?)
  6. reduced data set (z rows. o columns = bins plus attributes)
  7. dependent filters (e.g. fbd, taxa richness by environmental setting, NAs) (a (< z) rows, o columns)*
  8. final check (e.g. singletons, dplyr::distinct) (a or b (< a) rows, o columns)
    • clean_end

* I'm struggling to conceptually isolate these filters. It is something to do with the information in the 'non-dependent' columns not being brought through to the 'dependent' filter stage. i.e. the non-dependent filter step needs to deal with filters that are the same for any record (say, it is collected before a minimum year, or the record is in the centre of Adelaide) and / or it contains information that we want to lose before reducing (say, we want to exclude a particular survey). In contrast the dependent filters can only be applied once the records are reduced to their bins (particularly fbd, as it relies on standardised taxonomy, or, say, bins that are exceptionally rich in taxa compared to their environmental peers) and / or they contain columns that are required in down stream analysis (and therefore can't be removed, say, rel_metres_adj in envPIA).