PecanProject / pecan

The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.
www.pecanproject.org
Other
202 stars 234 forks source link

Enable reading of Ameriflux/Fluxnet BADM sheets #350

Open mdietze opened 9 years ago

mdietze commented 9 years ago

Migration of issue from old Redmine site (RM#2145)

Ankur's original description:

Ameriflux has been slowly publishing the revised biological, ancillary, disturbance, metadata sheets (BADM), hopefully more machine readable, but still in Excel. The website is http://ameriflux.lbl.gov/AmeriFluxSites/Pages/BADM-Site.aspx

In terms of tasks, I think this involves: 1) Getting all the useful (there are hundreds, including some like "manufacturer of your sonic anemometer" which we probably don't need) variables defined in BETY 2) describe the format in the database - I propose that we simply convert site specific BADM sheet XLS files to CSV 3) extend bety or pecan to read these BADM sheets to update priors and parameters. Do we store the entire CSV file somehow in BETA or write a script to harvest values whenever we get a new CSV to be inserted into BETY?

Please correct me on what the best approach here and excuse my poor use of terminology for various systems. Another issue is that many of the values are "summary" (like site LAI or total biomass in the flux footprint) instead of specific to a species. So how does that fit into the schema (site or PFT?)? I also suspect that these sheets will be updated by each site with some unpredictable frequency and interval - there doesn't seem to be a designated place to download these yet, but eventually it would be good to have a script that searches for updated BADM prior to running at a site.

Why this matters: the trait and site observations in the BADM are a gold-mine when it comes to constraining the models - a few observations from one variable here like LAI or total organic soil C is worth thousands of flux tower NEE observations. The challenge is the non-standard frequency, non-standard measurement protocols, and lack of sub-site level data. But some high frequency data will be here too like chamber flux soil respiration.

Anyway, seems like this is the time to start this discussion as Fluxnet is close to finalizing this finally.

Mike's Response:

For tasks, yes those three sound right, with the third being the bulk of the work.

The biggest question here seems to be whether to extract all possible info from these files and insert it into the traits table, or to access information from these files "on the fly" as needed.

The second question, what to do about file updates, files naturally from the first. If we access these files on the fly then updating versions is simple. If we insert the data into the database then with the upload we need to store some sort of log file that would allow all lines that were added to be deleted if the file is updated. From the updated file we would then insert the new values.

Based on the pain of doing the second case, I lean towards reading files on the fly. However, this will require either an update to read.trait.data to do this, or the addition of a module after read.trait.data that would update the data structure before the meta-analysis. The other reasons that I lean towards this approach are that 1) we may add new variables at a latter date, which is easier to deal with if we don't have to dump/reload all BADM, 2) there may be data that are closer to raw observations than summary stats, which should be treated differently in the meta-analysis than the literature data, 3) not all data may make sense as a 'trait' -- I could see a good chunk of this data being used in site-specific calibration & validation. The argument FOR doing the extraction up front is that you would have to read and parse ALL of the BADM files for a biome on the fly for things like the meta-analysis, however now that we're no longer re-running the meta-analysis every time the model is run this seems like a lot less overhead than it used to be.

Now, for the sake of getting started, the first two tasks, and the need to write code to parse these files for the third, will be identical regardless of whether we parse the files every time or just on updates.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 365 days with no activity.