SoilBGC-Datashare / sidb

Soil Incubation Database sidb
https://soilbgc-datashare.github.io/sidb/
MIT License
16 stars 10 forks source link

Basic QA/QC tool? #1

Closed jb388 closed 5 years ago

jb388 commented 5 years ago

@aahoyt @CaitlinPries @christinaschaedel @crlsierra @mazizirad @ShaneStoner @susanecrow

I've been working on an R report tool to simplify and improve querying in sidb, and I've run into several QA/QC issues that have required manual edits of templates. I think it could be helpful for the future (assuming we enter more studies) to build at least some sort of simple QA/QC tool. Additionally, formalizing the template entry procedure in some sort of document would also be incredibly helpful, both to serve as a QA/QC reference, and for working on future entries.

Specific issues:

Others? Volunteers to work on either of these projects...? : )

CaitlinPries commented 5 years ago

Jeff et al.,

Thanks for looking into that. I thought we already had a QA/QC tool in R. I also thought that Carlos was hesitant to make it too strict in terms of the units. That being said, it definitely makes sense that the variables have to match the timeSeries headings.

During our week together in March, I started working on an entry procedure document above. I am happy to keep working on it and welcome feedback. I will add the part about the variable names matching. I volunteer to continue working on that document, which I would love to be included as a supplement to the paper.

Best, Caitlin

=================================== Caitlin Hicks Pries Assistant Professor of Biological Sciences Dartmouth College Life Sciences Center, room 349 78 College St. Hanover, NH 03755

603-646-2052 http://sites.dartmouth.edu/hicksprieslab/ @Carbon_Cait on Twitter

From: Jeff B notifications@github.com Reply-To: SoilBGC-Datashare/sidb reply@reply.github.com Date: Tuesday, June 18, 2019 at 4:48 PM To: SoilBGC-Datashare/sidb sidb@noreply.github.com Cc: Caitlin Pries caitlin.pries@dartmouth.edu, Mention mention@noreply.github.com Subject: [SoilBGC-Datashare/sidb] Basic QA/QC tool? (#1)

@aahoythttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Faahoyt&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cffaa1b0460764f43957108d6f42e6017%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636964877356039317&sdata=IgZd%2BSbLtM5GB%2F0HOIzh17T6EahqTt65UxqcWX4MHho%3D&reserved=0 @CaitlinPrieshttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCaitlinPries&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cffaa1b0460764f43957108d6f42e6017%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636964877356049330&sdata=0b9P96%2Fq3kPDkTx6oa25ZR9tXqPkku41RU5tGiPS2po%3D&reserved=0 @christinaschaedelhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchristinaschaedel&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cffaa1b0460764f43957108d6f42e6017%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636964877356049330&sdata=OKJ2zMOyALFN4cNdD07aPghuE0q4zjLBAOLZeo25iuQ%3D&reserved=0 @crlsierrahttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcrlsierra&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cffaa1b0460764f43957108d6f42e6017%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636964877356059329&sdata=MtZYwKvovpRLHSekADbjujYoJAi2bXbSOvf7LeIdQkU%3D&reserved=0 @maziziradhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmazizirad&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cffaa1b0460764f43957108d6f42e6017%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636964877356069342&sdata=Ps1FU7r7D0M5dJH%2BVHI6QUNBUhTyg7U9BbM%2BPMwC2xs%3D&reserved=0 @ShaneStonerhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FShaneStoner&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cffaa1b0460764f43957108d6f42e6017%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636964877356069342&sdata=hS%2BffJZOIn%2F1l%2Fk%2BiGZBzyrv%2F9XO39II0aWofjsqgvQ%3D&reserved=0 @susanecrowhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsusanecrow&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cffaa1b0460764f43957108d6f42e6017%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636964877356079350&sdata=hVnUh3xyAPd02%2BaiU9y58QjN7BbH8ryLTEN0elmhWxk%3D&reserved=0

I've been working on an R report tool to simplify and improve querying in sidb, and I've run into several QA/QC issues that have required manual edits of templates. I think it could be helpful for the future (assuming we enter more studies) to build at least some sort of simple QA/QC tool. Additionally, formalizing the template entry procedure in some sort of document would also be incredibly helpful, both to serve as a QA/QC reference, and for working on future entries.

Specific issues:

Others? Volunteers to work on either of these projects...? : )

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSoilBGC-Datashare%2Fsidb%2Fissues%2F1%3Femail_source%3Dnotifications%26email_token%3DABXVALUGPLDQQGKVYY5J3ZTP3FC3JA5CNFSM4HZDXVB2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G2HY62A&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cffaa1b0460764f43957108d6f42e6017%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636964877356089358&sdata=2NpoQpd%2BNp6vKCYzCAhg0ivK4jFgeksPUUtCWQXeODM%3D&reserved=0, or mute the threadhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXVALVANIISE2VKAVAAD3DP3FC3JANCNFSM4HZDXVBQ&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Cffaa1b0460764f43957108d6f42e6017%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636964877356089358&sdata=jw4aBU7iOjZRwVSeok%2BsO%2FSnkfYP2z8VjMPozvgiHRE%3D&reserved=0.

jb388 commented 5 years ago

Great! I thought that we had started a "user guide" document, but I couldn't remember what happened to it. I think that would be a nice supplement to the paper, or at the least, super helpful for the website.

There's a QA/QC tool that checks which fields are missing from a given metadata file by comparing it to the template, but that's it as far as I can tell.

As far as units, I thought that we had actually agreed on a list---not enforcing particular ones, per se, but enforcing how they are reported at least.

christinaschaedel commented 5 years ago

I think that sounds great. As a supplementary piece definitely useful

ShaneStoner commented 5 years ago

I know we agreed on a list of units, but I'm honestly not sure if it was ever written down on paper or electronically. Might be good to circulate a list and confirm them again?

Good work to Jeff and Caitlin!

On 6/18/19 23:27, Jeff B wrote:

Great! I thought that we had started a "user guide" document, but I couldn't remember what happened to it. I think that would be a nice supplement to the paper, or at the least, super helpful for the website.

There's a QA/QC tool that checks which fields are missing from a given metadata file by comparing it to the template, but that's it as far as I can tell.

As far as units, I thought that we had actually agreed on a list---not enforcing particular ones, per se, but enforcing how they are reported at least.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SoilBGC-Datashare/sidb/issues/1?email_source=notifications&email_token=AJA4EY22FY3PBLEVF55TIX3P3FHLDA5CNFSM4HZDXVB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYAAQSQ#issuecomment-503318602, or mute the thread https://github.com/notifications/unsubscribe-auth/AJA4EY3K3BRU3R6ODBVNQ73P3FHLDANCNFSM4HZDXVBQ.

crlsierra commented 5 years ago

Yes, a QA/QC function that checks correspondence between variables in the metadata file and the csv files is extremely important. Ideally, we would write separate functions that check different aspects of the functionality of the database. These function can be stored either in the test folder that already exist, or as a separate folder within the R package. The idea behind having multiple functions with different tests is that we can run all the tests automatically and easily detect what went wrong. The Travis CI framework that Jeff mentioned recently in an email can run all tests automatically. The important thing is that we write the tests in the first place.

As for the units, they are currently in the template_metadata_workingfile.yaml file.

CaitlinPries commented 5 years ago

Here is the list of unit recommendations:

Parameter Unit latitude and longitude Decimal degrees MAT Celsius MAP mm depth cm, m temperature Celsius moisture percentGWC, percentFieldCapacity, percentWaterFilledPoreSpace carbon percent, mg/gSoil, g/gSoil, microg/gSoil nitrogen percent, mg/gSoil, g/gSoil, microg/gSoil bulkDensity g/cm3 redox mV fluxes gC-CO2/gSoil, gC/gC, microgC/gC/h, molC/gSoil/d, or combinations of the above

=================================== Caitlin Hicks Pries Assistant Professor of Biological Sciences Dartmouth College Life Sciences Center, room 349 78 College St. Hanover, NH 03755

603-646-2052 http://sites.dartmouth.edu/hicksprieslab/ @Carbon_Cait on Twitter

From: Shane Stoner notifications@github.com Reply-To: SoilBGC-Datashare/sidb reply@reply.github.com Date: Wednesday, June 19, 2019 at 2:17 AM To: SoilBGC-Datashare/sidb sidb@noreply.github.com Cc: Caitlin Pries caitlin.pries@dartmouth.edu, Mention mention@noreply.github.com Subject: Re: [SoilBGC-Datashare/sidb] Basic QA/QC tool? (#1)

I know we agreed on a list of units, but I'm honestly not sure if it was ever written down on paper or electronically. Might be good to circulate a list and confirm them again?

Good work to Jeff and Caitlin!

On 6/18/19 23:27, Jeff B wrote:

Great! I thought that we had started a "user guide" document, but I couldn't remember what happened to it. I think that would be a nice supplement to the paper, or at the least, super helpful for the website.

There's a QA/QC tool that checks which fields are missing from a given metadata file by comparing it to the template, but that's it as far as I can tell.

As far as units, I thought that we had actually agreed on a list---not enforcing particular ones, per se, but enforcing how they are reported at least.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SoilBGC-Datashare/sidb/issues/1?email_source=notifications&email_token=AJA4EY22FY3PBLEVF55TIX3P3FHLDA5CNFSM4HZDXVB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYAAQSQ#issuecomment-503318602, or mute the thread https://github.com/notifications/unsubscribe-auth/AJA4EY3K3BRU3R6ODBVNQ73P3FHLDANCNFSM4HZDXVBQ.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSoilBGC-Datashare%2Fsidb%2Fissues%2F1%3Femail_source%3Dnotifications%26email_token%3DABXVALXB33VB5SE7QAFEOZTP3HFQTA5CNFSM4HZDXVB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYAZ57A%23issuecomment-503422716&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Ccb68ad30750e4443c79908d6f47dd890%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636965218681319159&sdata=RTe664koL43GPFrQ%2B5yj7V45y9%2FJSP0XsKePo2u1J68%3D&reserved=0, or mute the threadhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXVALQDWGLZ5UV5UXAPXI3P3HFQTANCNFSM4HZDXVBQ&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7Ccb68ad30750e4443c79908d6f47dd890%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636965218681329172&sdata=vkVwtqyGbSSr4EkdNLyAt1hK5V9CKRLdC0prJFk29NM%3D&reserved=0.

crlsierra commented 5 years ago

I just finished implementing the test infrastructure for sidb. It consists of two parts, 1) specific tests about the data, and 2) tests for the entire R package. The specific tests are located in the folder Rpkg/tests/testthat/. It uses the testthat package to run very specific tests about the data. For example, at the moment there is one test that checks that all entries can be read in R. Another test checks that the names of the fields in the metadata files agree with the names in the metadata template. We can add more tests here as needed. The general R tests are run by the commands R CMD build and R CMD check. A script in the test folder runs these commands automatically. I recommend everybody to run these tests before pushing anything to the repository. I'm also working on a continuous integration workflow in Travis CI, which will run all tests automatically in a remote server and notify us by email if something goes wrong. The tests are failing at the moment because of a problem with paths of directories in the remote machine. Once I figure it out, the test infrastructure will be ready and I will close this issue. Let me know if you have any comment or suggestion.

jb388 commented 5 years ago

Nice going, Carlos! I also started writing some additional tests for the test framework. The additional test script "test_dataStructure" is now on the dev branch.

Currently there are two new tests: One looks to make sure the site names in the initiConditions.csv file match those in the siteInfo file; and one looks to make sure each table in the variables list of the metadata file has the same number of fields.

jb388 commented 5 years ago

@CaitlinPries I didn't see the allowable unit list for time variables in what you shared above. If I remember correctly, we had "d" for days, and "h" for hours. I think that was it?

CaitlinPries commented 5 years ago

Yes, it is day or hours according to the flux units, but yes we do need to add a time unit to that doc.

=================================== Caitlin Hicks Pries Assistant Professor of Biological Sciences Dartmouth College Life Sciences Center, room 349 78 College St. Hanover, NH 03755

603-646-2052 http://sites.dartmouth.edu/hicksprieslab/ @Carbon_Cait on Twitter

From: Jeff B notifications@github.com Reply-To: SoilBGC-Datashare/sidb reply@reply.github.com Date: Friday, June 28, 2019 at 8:36 AM To: SoilBGC-Datashare/sidb sidb@noreply.github.com Cc: Caitlin Pries caitlin.pries@dartmouth.edu, Mention mention@noreply.github.com Subject: Re: [SoilBGC-Datashare/sidb] Basic QA/QC tool? (#1)

@CaitlinPrieshttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCaitlinPries&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7C70df7af6e0fb4a9206f808d6fbc54609%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636973222036082846&sdata=6pNeTPfQf3KxAl%2BLQdREeIDkja2REhhvAlwDg%2FVSwdI%3D&reserved=0 I didn't see the allowable unit list for time variables in what you shared above. If I remember correctly, we had "d" for days, and "h" for hours. I think that was it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSoilBGC-Datashare%2Fsidb%2Fissues%2F1%3Femail_source%3Dnotifications%26email_token%3DABXVALSMFOWPSWKZSK3K373P4YAVTA5CNFSM4HZDXVB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYZ6QUQ%23issuecomment-506718290&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7C70df7af6e0fb4a9206f808d6fbc54609%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636973222036092859&sdata=L9Z1YfQHD%2BcGGfXfSmElVJYmKSENQWVFWfOECas7dGg%3D&reserved=0, or mute the threadhttps://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABXVALW6ACS2UN3USJYE4DLP4YAVTANCNFSM4HZDXVBQ&data=02%7C01%7Ccaitlin.pries%40dartmouth.edu%7C70df7af6e0fb4a9206f808d6fbc54609%7C995b093648d640e5a31ebf689ec9446f%7C0%7C0%7C636973222036092859&sdata=04BsILEjRbybl4Uf%2B8S0eDusDglslUb7XMw6ZCtD9x4%3D&reserved=0.

jb388 commented 5 years ago

I'm closing this issue since QAQC is now accomplished via tests. More tests need to be written, but in the future we can create specific github issues for specific tests.