CIFOR Data Ingestion - Githubissues

jaxinewolfe commented 1 year ago

Issue for addressing general or shared questions about the ingestion of the CIFOR database

jaxinewolfe commented 1 year ago

I have created a CIFOR folder in primary_studies – we can use this to stash all the original data we've downloaded from the database. Hook scripts will likely be separate, unless we find some necessary overlap (ex. similar structure in data types or shared sampling location).

I have also added a function to the curation_functions called readExcelWorkbook(path) – if you provide this fxn with a path to an xlsx file with multiple sheets it will read everything into a list object. Each list element is one sheet of that workbook, which you can access using the sheet name in double brackets: ex. result_list[["Name of Sheet"]] or with a number index: ex. result_list[[1]]

jaxinewolfe commented 1 year ago

Ok y'all, I changed my mind 🤠 The synthesis functions are written! After some more consideration, I think the data is standardized enough to pull it off, and it will save us the tire of many individual hook scripts. That being said, thanks for all the work you've put in so far on this :) It been a huge effort just figuring out how to slice and dice this database ingestion. I expect that any curation you've done for specific datasets should be transferrable to the aggregated dataset, either specifically or in a generalized fashion. So!

I wrote two similar functions that work with the two different kinds of formats we've encountered. You can source them from _scripts/1_data_formatting/cifor_utilityfunctions.R. The only argument to specify is what data type you are looking to synthesize: soil, vegetation, or necromass. Starting with soil for now since we're still hashing out the CCN biomass structure.

The first is synthSWAMP() - this reads all of the xlsx workbooks that are formatted according to the SWAMP data structure and merges them into one list. Once this fxn is run, you can dis-assemble the resulting list to curate the individual data tables. @BettsH since you have been developing the function to recode the reference variables in these tables, would you mind taking on the curation for data in this format?

The next is synthCIFOR() - which reads all of the xlsx workbooks that are formatted according to...whatever the other data structure is (I just call it "alternative" lol). @cheneyr you've been working with these data quite a bit and finding connections between the veg and soil tables. Could you work on the curation of data in this format? Note: the coordinates will definitely need some resolution, we can talk about that Mon.

Try em out and see how they work for you! I tried to include documentation, but I'm also happy to walk through them together if you'd like to dig into the structure and function a bit more. In the CIFOR_docs folder, I've included an _alt_datastructure.csv and a _swamp_datastructure.csv. For the data structure you're working with, if you manually fill an additional column with corresponding CCN attributes, you can write a function which reads it in and performs the column renaming accordingly. (Or you can rename everything in the hook script - up to you!)

Take your time exploring the datasets - testing, plotting and mapping (we'll definitely want data viz reports for these). And lastly, we'll need to keep track of all the associated publications. (Which will end up helping us figure out which datasets are related). Let me know if all of that sounds good!

jaxinewolfe commented 1 year ago

Hey @cheneyr and @BettsH ! I am doing practice runs for the synthesis update today so we can see what we're working with and make any necessary tweaks to hook scripts. Could you output your curated CIFOR data into separate derivative folders within the CIFOR folder (ex. CIFOR/derivative_SWAMP and CIFOR/derivative_ALT)? I know the bibliographies and biomass data might not be perfect, but that's totally ok.

Smithsonian / CCN-Data-Library

CIFOR Data Ingestion #84