SoilBGC-Datashare / sidb

Soil Incubation Database sidb
https://soilbgc-datashare.github.io/sidb/
MIT License
16 stars 10 forks source link

Possible template update #2

Closed jb388 closed 5 years ago

jb388 commented 5 years ago

The current version of our template has multiple structural and syntactic inconsistencies, which makes both data analysis and entry challenging.

As an example of structural inconsistency: the variables "temperature" and "moisture" can be found at three different possible levels within the database, with varying degrees of duplication.

As an example of syntactic inconsistency, the variable "landCover" can also be entered as an experimental treatment if it varies across sites, and when entered at that level of the database it may have a unique name that would prevents querying the database for this variable effectively.

I propose the following changes: 1) Adopt a rule for the data in the incubationInfo section that data is ONLY entered here if it is consistent across the entire study. Variables that vary (e.g. laboratory treatments) are entered at the variable level of the database (i.e. for each individual timeseries) already, so it is unnecessary and confusing to enter the data here. 2) Unnest any constant variables that are currently nested in the sublist "laboratoryTreatment" within the incubationInfo section. Defining laboratoryTreatment as a separate list is unnecessarily complicated and leads to structural inconsistency. 3) Reform the siteInfo section so all data has to have the same shape as the site array, i.e. move depth, permafrost, and experimental treatment info to the incubationInfo section. 4) Simplify experimentalTreatment to a single categorical response. Timeseries from studies with multiple treatment levels specify the treatment level explicitly in the variables section, and with only a simple categorical response it will be easier to group similar studies, e.g. "Harvard Forest experimental warming" and "fieldWarming" could both be called "fieldWarming". 5) If the above changes are adopted, consider renaming incubationInfo experimentInfo 6) Consider either reformatting the siteInfo section or changing it to a tabular form for ease of data entry.

I spent a couple days trying to do some simple reports and queries, and found it required a great deal of data manipulation---not at all user friendly. Hence the proposed changes.

Updating the templates isn't difficult, and I have already done so with an offline set of templates as a proof of concept, and written some query/report scripts.

However, if we do decide to adopt these changes it will affect the code that has already been written, as well as require some minor changes to the description of the template in the manuscript, so I realize it's not an easy decision.

Perhaps it's just me, so If you haven't done so, I would encourage you to try doing some data analysis across all studies with the current version of the database and see what you think.

CaitlinPries commented 5 years ago

Jeff,

I see some issues with what you are proposing.

1,3: First, if you want to make incubationInfo only for data that is consistent across the entire study, then why would you move depth and experimental treatment to the incubationInfo section as those are variables that are not always consistent across the entire study?

Second, I don’t see what the issue is with redundancy between the variables section and incubation info section. I was thinking by filling it out, people would better know which subheadings they need to include with each variable. Maybe if we had a separate section that was called treatmentInfo that gave arrays for all variables being tested in the study that are not site related? Or see experimentalTreatment thought below.

2: I agree that constant variables should be unnested.

experimentalTreatment may still have to have multiple responses if multiple factors are being tested. So it may still have to be an array, such as fieldWarming and depth, with the levels for each factor specified in variables.

We want to keep the yaml file, so turning siteInfo to a table does not seem like a good option to me. We already have two table (excel) files. I don’t think we want more.

So overall, if we need to make a change, I think a new category of all the treatment factors would be useful. But I also don not see an issue with then including the levels of each of those as a sub array so that a quick look of the data shows all the treatments. Of course, I say this without having played with the data as much as you have, Jeff.

Best, Caitlin

Caitlin Hicks Pries Assistant Professor of Biological Sciences Dartmouth College Life Sciences Center, room 349 78 College St. Hanover, NH 03755

603-646-2052 http://sites.dartmouth.edu/hicksprieslab/ @Carbon_Cait on Twitter

From: Jeff B notifications@github.com Reply-To: SoilBGC-Datashare/sidb reply@reply.github.com Date: Tuesday, June 25, 2019 at 1:04 PM To: SoilBGC-Datashare/sidb sidb@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [SoilBGC-Datashare/sidb] Possible template update (#2)

The current version of our template has multiple structural and syntactic inconsistencies, which makes both data analysis and entry challenging.

As an example of structural inconsistency: the variables "temperature" and "moisture" can be found at three different possible levels within the database, with varying degrees of duplication.

As an example of syntactic inconsistency, the variable "landCover" can also be entered as an experimental treatment if it varies across sites, and when entered at that level of the database it may have a unique name that would prevents querying the database for this variable effectively.

I propose the following changes:

  1. Adopt a rule for the data in the incubationInfo section that data is ONLY entered here if it is consistent across the entire study. Variables that vary (e.g. laboratory treatments) are entered at the variable level of the database (i.e. for each individual timeseries) already, so it is unnecessary and confusing to enter the data here.
  2. Unnest any constant variables that are currently nested in the sublist "laboratoryTreatment" within the incubationInfo section. Defining laboratoryTreatment as a separate list is unnecessarily complicated and leads to structural inconsistency.
  3. Reform the siteInfo section so all data has to have the same shape as the site array, i.e. move depth, permafrost, and experimental treatment info to the incubationInfo section.
  4. Simplify experimentalTreatment to a single categorical response. Timeseries from studies with multiple treatment levels specify the treatment level explicitly in the variables section, and with only a simple categorical response it will be easier to group similar studies, e.g. "Harvard Forest experimental warming" and "fieldWarming" could both be called "fieldWarming".
  5. If the above changes are adopted, consider renaming incubationInfo experimentInfo
  6. Consider either reformatting the siteInfo section or changing it to a tabular form for ease of data entry.

I spent a couple days trying to do some simple reports and queries, and found it required a great deal of data manipulation---not at all user friendly. Hence the proposed changes.

Updating the templates isn't difficult, and I have already done so with an offline set of templates as a proof of concept, and written some query/report scripts.

However, if we do decide to adopt these changes it will affect the code that has already been written, as well as require some minor changes to the description of the template in the manuscript, so I realize it's not an easy decision.

Perhaps it's just me, so If you haven't done so, I would encourage you to try doing some data analysis across all studies with the current version of the database and see what you think.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSoilBGC-Datashare%2Fsidb%2Fissues%2F2%3Femail_source%3Dnotifications%26email_token%3DABXVALV5L6HKNCEQIHKO4RLP4JF2RA5CNFSM4H3KM74KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G3TITZA&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cb6e05ed4af1a40490cff08d6f98f36b7%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636970790828905745&sdata=W9UXBvNh2vlxNeqQoM5uGC3TukK8qpjOrx7Z1la%2BNEQ%3D&reserved=0, or mute the threadhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXVALXHQIGFSQ7U7IZHMFLP4JF2RANCNFSM4H3KM74A&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cb6e05ed4af1a40490cff08d6f98f36b7%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636970790828905745&sdata=Mgjm45V6RzHB4vLoSUtvXFYP4E0%2BnaTXqBtBJjant8s%3D&reserved=0.

jb388 commented 5 years ago

@CaitlinPries Thanks for chiming in.

Just to clarify, moving depth to the incubationInfo section doesn't imply that it is always constant. I worded that poorly. However, if depth does vary (i.e. multiple depth increments incubated), than depth will be reported at the variables level, and it's redundant to explicitly list higher up in the hierarchy. Personally I think it could be potentially confusing to list the depth increments at the site level, as the array is not the same size as the other siteInfo fields, and more importantly it's redundant since the data are repeated at the variable level.

I agree that the proposed change to experimentalTreatment is not great---good point that what proposed wouldn't allow for multiple experimentalTreaments, e.g. a factorial design. So rescind that!

I also agree with keeping the .yaml file---that's why I proposed updating the structure so it is a bit easier to work with. So fine with me to keep the siteInfo in .yaml rather than .csv format.

aahoyt commented 5 years ago

Just to clarify - Re: "if you want to make incubationInfo only for data that is consistent across the entire study, then why would you move depth and experimental treatment to the incubationInfo section as those are variables that are not always consistent across the entire study? "

I think what Jeff meant is that incubationInfo includes variables that could either be constant or vary. If they vary, the value of the field would be filled in under the variable section, rather than under the incubationInfo section. If they don't vary, the value is entered under the incubationInfo section. So, for example, "depth: " would be left blank under incubationInfo if there were multiple depths, and then filled in under the variables.

Do you still think that would be problematic?

jb388 commented 5 years ago

One more note about including data in sublevels: the main issue is that data can be at either the variable level OR at the incubationInfo level OR at the sublevel of incubationInfo. This adds extra steps when coding any queries, and additionally, can lead to errors if there are inconsistencies between what was entered at different levels.

CaitlinPries commented 5 years ago

That totally makes sense and I agree about keeping the site level arrays all the same length and thus removing depth to incubationInfo where it, like temp or moisture, would be at the study level unless it was a tested factor.

What do you think about a Treatment array, maybe under incubationInfo, that includes all the factors that change in a study that are NOT site? This could be temp, depth, moisture, field treatment, etc? Then the factors listed under that arra

I know it is redundant to have the treatments listed under variables and in the information section, but I think that is a benefit and will help people clarify the fields that need to be included for each variable. I also think that may make it more query-able. Like to be able to quickly see what factors were tested under the Treatment?

=================================== Caitlin Hicks Pries Assistant Professor of Biological Sciences Dartmouth College Life Sciences Center, room 349 78 College St. Hanover, NH 03755

603-646-2052 http://sites.dartmouth.edu/hicksprieslab/ @Carbon_Cait on Twitter

From: Jeff B notifications@github.com Reply-To: SoilBGC-Datashare/sidb reply@reply.github.com Date: Tuesday, June 25, 2019 at 4:43 PM To: SoilBGC-Datashare/sidb sidb@noreply.github.com Cc: Caitlin Pries caitlin.pries@dartmouth.edu, Mention mention@noreply.github.com Subject: Re: [SoilBGC-Datashare/sidb] Possible template update (#2)

One more note about including data in sublevels: the main issue is that data can be at either the variable level OR at the incubationInfo level OR at the sublevel of incubationInfo. This adds extra steps when coding any queries, and additionally, can lead to errors if there are inconsistencies between what was entered at different levels.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSoilBGC-Datashare%2Fsidb%2Fissues%2F2%3Femail_source%3Dnotifications%26email_token%3DABXVALRSJBPCEIZEIBOS2F3P4J7O3A5CNFSM4H3KM74KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYRREQQ%23issuecomment-505614914&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7C73b8391161dc43f5f54408d6f9adc5c4%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636970922076221942&sdata=IGN09rfUEi623o62Q4ZHr3%2BPQZIbiHkyfmEqWVDW4Qs%3D&reserved=0, or mute the threadhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXVALXGOBL4S4CSJFDOACDP4J7O3ANCNFSM4H3KM74A&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7C73b8391161dc43f5f54408d6f9adc5c4%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636970922076231950&sdata=MPCkWS223I5rbXUBR7p%2FmRdRwDT9N3OCE3cZdfjRV5U%3D&reserved=0.

aahoyt commented 5 years ago

Also, another idea for the data that is listed with a bunch of tick marks in the site info section - Isn't the initConditions file already organized by site? If so (in future templates at least), could we just put that info as a few more columns in the initConditions file? I think that would be less error-prone than all the tick-marks and isn't a big change.

CaitlinPries commented 5 years ago

Hmm . . . not exactly as initConditions can also change by experimental (field treatment) and by depth. So by more than just site. Which may be one reason that depth and experimental treatment were originally in siteInfo to begin with.

=================================== Caitlin Hicks Pries Assistant Professor of Biological Sciences Dartmouth College Life Sciences Center, room 349 78 College St. Hanover, NH 03755

603-646-2052 http://sites.dartmouth.edu/hicksprieslab/ @Carbon_Cait on Twitter

From: Alison Hoyt notifications@github.com Reply-To: SoilBGC-Datashare/sidb reply@reply.github.com Date: Tuesday, June 25, 2019 at 4:47 PM To: SoilBGC-Datashare/sidb sidb@noreply.github.com Cc: Caitlin Pries caitlin.pries@dartmouth.edu, Mention mention@noreply.github.com Subject: Re: [SoilBGC-Datashare/sidb] Possible template update (#2)

Also, another idea for the data that is listed with a bunch of tick marks in the site info section - Isn't the initConditions file already organized by site? If so (in future templates at least), could we just put that info as a few more columns in the initConditions file? I think that would be less error-prone than all the tick-marks and isn't a big change.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSoilBGC-Datashare%2Fsidb%2Fissues%2F2%3Femail_source%3Dnotifications%26email_token%3DABXVALRX7T7OPKCHTWHVHELP4J76PA5CNFSM4H3KM74KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYRRPQA%23issuecomment-505616320&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cb8caa2ff8e044ef4359108d6f9ae5b2f%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636970924585714681&sdata=8nF7eJ4US6%2B5T8qNM1T4fnfao66SKfwHC9iUc83YbM4%3D&reserved=0, or mute the threadhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXVALW2FVWDS3IV5UPI2T3P4J76PANCNFSM4H3KM74A&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cb8caa2ff8e044ef4359108d6f9ae5b2f%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636970924585724694&sdata=GW2pqUSfLsMfHEYfGfGxndB8AKBIv2Sy42%2FaImDmFos%3D&reserved=0.

aahoyt commented 5 years ago

hmm... in that case are we missing a critical link since we don't have depth as a column in initConditions.csv? So, if you have two rows with the same site and different depths, how to do you link them to the appropriate time series?

CaitlinPries commented 5 years ago

Should we move this to slack to save people’s inboxes?

But, if depth changes, it is a column in initConditions. See Salome.

=================================== Caitlin Hicks Pries Assistant Professor of Biological Sciences Dartmouth College Life Sciences Center, room 349 78 College St. Hanover, NH 03755

603-646-2052 http://sites.dartmouth.edu/hicksprieslab/ @Carbon_Cait on Twitter

From: Alison Hoyt notifications@github.com Reply-To: SoilBGC-Datashare/sidb reply@reply.github.com Date: Tuesday, June 25, 2019 at 4:54 PM To: SoilBGC-Datashare/sidb sidb@noreply.github.com Cc: Caitlin Pries caitlin.pries@dartmouth.edu, Mention mention@noreply.github.com Subject: Re: [SoilBGC-Datashare/sidb] Possible template update (#2)

hmm... in that case are we missing a critical link since we don't have depth as a column in initConditions.csv? So, if you have two rows with the same site and different depths, how to do you link them to the appropriate time series?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSoilBGC-Datashare%2Fsidb%2Fissues%2F2%3Femail_source%3Dnotifications%26email_token%3DABXVALWL6L2HDRBO6TAXOB3P4KAZ3A5CNFSM4H3KM74KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYRSEAA%23issuecomment-505618944&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7C95ab933245a04af18a8108d6f9af6019%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636970928963556632&sdata=lPY473cRbc1G%2Bky37EbA2Kv34JTvl8jsMZl8qNooIkc%3D&reserved=0, or mute the threadhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXVALT6UIM7U4EF7FXUXGTP4KAZ3ANCNFSM4H3KM74A&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7C95ab933245a04af18a8108d6f9af6019%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636970928963556632&sdata=ywDK%2Bqi6aYwC028rLmOJ4Yohzq43AWDx6jdACkCZiJQ%3D&reserved=0.

jb388 commented 5 years ago

@CaitlinPries I see what you mean about the benefits of a "treatment" array. I think that would be fine as long as any data that are reported there are also reported at the variable level, and there isn't a third place to report it!

@aahoyt Good point about the link with depth. That's important. Also reminds me of what we were talking about earlier about the lack of a link between timeseries of mean values and sd/se---i.e. no clean way to associate a mean value with its associated sd/se.

aahoyt commented 5 years ago

also, probably not urgent, but another link that is currently missing is that there is no way to link a timeseries of the CO2 values to the corresponding time series of d13C or SD values from the same incubation jar, besides any author-specific naming convention.

jb388 commented 5 years ago

Looks like "midDepth" is reported in several of the initCond files, not just Salome.

aahoyt commented 5 years ago

Thanks for pointing that out! I thought the initConditions file was fixed. Are there also examples where people had fieldTreatment columns there?

jb388 commented 5 years ago

Just one more note, I've rethought it a bit, and I think permafrost info should stay in the siteInfo section. Certainly possible for some sites in one study to have permafrost and some not to (or at least different active layer depths), so why shouldn't that array also be the same dimension as the site array?

Seems like these potential template changes might be a good topic for a conference call.

CaitlinPries commented 5 years ago

Good point. I agree that this is worth a conference call. Jeff, can you maybe send out a yaml with your proposed changes? I could then incorporate my idea or you could add the treatmentInfo section. It should look like this below, but be broadened to also what is currently under experimental treatment.

laboratoryTreatment: # List of laboratory treatments applied to the incubations (e.g., temperature, moisture, amendments, etc.) treatment: # Replace with name of treatment value: # Numerical or factor

ALSO, the value of the above is that it includes units for the treatments themselves, which are not included under variables!!!

=================================== Caitlin Hicks Pries Assistant Professor of Biological Sciences Dartmouth College Life Sciences Center, room 349 78 College St. Hanover, NH 03755

603-646-2052 http://sites.dartmouth.edu/hicksprieslab/ @Carbon_Cait on Twitter

From: Jeff B notifications@github.com Reply-To: SoilBGC-Datashare/sidb reply@reply.github.com Date: Tuesday, June 25, 2019 at 5:06 PM To: SoilBGC-Datashare/sidb sidb@noreply.github.com Cc: Caitlin Pries caitlin.pries@dartmouth.edu, Mention mention@noreply.github.com Subject: Re: [SoilBGC-Datashare/sidb] Possible template update (#2)

Just one more note, I've rethought it a bit, and I think permafrost info should stay in the siteInfo section. Certainly possible for some sites in one study to have permafrost and some not to (or at least different active layer depths), so why shouldn't that array also be the same dimension as the site array?

Seems like these potential template changes might be a good topic for a conference call.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSoilBGC-Datashare%2Fsidb%2Fissues%2F2%3Femail_source%3Dnotifications%26email_token%3DABXVALQY55YIGQQ3WRMKHGDP4KCGLA5CNFSM4H3KM74KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYRTC4Q%23issuecomment-505622898&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cf888bd7ce48f4e9533a508d6f9b10829%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636970936074675604&sdata=BTYj5hn3NcLnjFRTFkU61UZ2Zuzl6mZboyTw%2FKCSi9Q%3D&reserved=0, or mute the threadhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXVALTX7J6S6UI6NZSVTITP4KCGLANCNFSM4H3KM74A&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cf888bd7ce48f4e9533a508d6f9b10829%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636970936074685617&sdata=JKJPddUpqP5Lmwr8%2Bo4zWghLX7JXuMtSF7lIdvKJEOw%3D&reserved=0.

jb388 commented 5 years ago

@CaitlinPries @aahoyt @crlsierra @ShaneStoner @susanecrow @christinaschaedel @mazizirad See sidb/data/template_metadata_new.yaml for an example of a possible new template.

@CaitlinPries My thought about how to deal with treatments is to just put everything into the incubationInfo section. I don't see a reason to distinguish between field and laboratory treatments. I adopted your idea about the structure: experimentalTreatment: value: units: I also added a "built-in" experimentalTreatment for ammendment, as this is quite common: ammendment: value: units:

But feel free to modify the file that I uploaded.

ShaneStoner commented 5 years ago

My two cents until the conference call, specifically about listing depth in the initConditions file (since I personally did that): Unless I'm wrong, there's currently no way to include initial soil data for multiple depths at the same site.

A couple more cents, I guess: I like @CaitlinPries 's idea about the laboratoryTreatment section as a way to assist data entry and query experimental treatments available across the database. I think we could pretty easily make sure a QA/QC prevents redundancies from causing problems in data querying.

jb388 commented 5 years ago

I've implemented the proposed changes on a new branch of the repository ("dev").

If you are interested in seeing what the new templates look like, you can install the R package using: devtools::install_github("SoilBGC-Datashare/sidb/Rpkg", ref="dev")

If you want to look at the script I wrote to manipulate the data, you can download the script "flatter_reports_sibdNew.R" from the directory sidb/scripts (on dev branch). The goal with this script was to generate an easily query-able dataframe, which is linked by a unique identifier to the relevant timeseries.

The script currently generates two lists (hence the name "flat"-er): 1) ts.l

aahoyt commented 5 years ago

Great work - this is a major advance!

jb388 commented 5 years ago

Based on yesterday's conversation in which we agreed to use the new format, I am closing this issue.