bpbond / cpcrw_incubation

PNNL TES incubation of CPCRW soil cores #openexperiment
http://bpbond.github.io/cpcrw_incubation
MIT License
12 stars 3 forks source link

Addittional Replication figure in "Missing/Problematic Data" section #18

Closed apeyton closed 7 years ago

apeyton commented 8 years ago

There are a few issues showing up on our missing/problematic data (especially in regards to replication) that I am concerned with. The later leads me to wonder if some of the data we agreed should be removed is included in the analysis. Here is a list of my concerns:

  1. Correct days/times. Since the Picarro is on Zulu time (GMT) our measurements may cross over to a new day. For example, our measurements on Nov 11th occurred at the end of Nov 11th and the beginning of Nov 12th (GMT). This makes it look like we measured cores on different data. I will see what I can do on the Picarro end (i.e. trying to reset the time zone), but it may be easier to code in a conversion so we don't see random days with minimal replications show up on our "missing/problematic data" figures. The first figure shows Cores 2, 6, 28, 27, 24, 19 (20C controlled drought cores) were ran on the 12th.
  2. November 11th shows that cores 9, 21, 14, 18, 23, 37 (20C, drought cores) were only measured once. But they were ran in duplicate...some of the measurements occurred on Nov 12th (GMT). 11Nov2015.pdf. Here is a graph in frac_hrs_since_jan1 that shows each treatment got at least 2 replicate readings.
  3. October 27th. We agreed to remove many of the duplicated runs (for reasons of Picarro not measuring out of several ports that day, etc) and so we should have so many replicates. It also looks like the replicates are mislabeled! It shows that cores 23, 18, 15, 21, 9, 37 have 6 replicate readings when they actually only have 2! I ran cores 25, 36, 8, 7, 29, 20 4 times (repeated a duplicate run) and ran cores 12, 26, 14, 1, 16, 13, 6 times (repeated duplicated run 2x with no luck of registering above ambient, and then ran a final duplicate reading using different valve ports for a total of 6). So it looks like we may have to review Oct 27 data and make sure that it is accurate. 27 Oct Picarro readings.pdf
  4. Let's review Sept. 22nd as well. It has 4 replicates for 38, 39, 10, 5, 22, 34 (field moist 20C cores) and it would be good to make sure that the correct data is being used in the analysis downstream (such as CO2 AND variability/outliers).
bpbond commented 8 years ago

Hey, sorry for the slow response @apeyton - tough week. Thanks. Yes, I think we'll have a list of potential problems to look at, these and others. At this point, I'd say let's finish up the 100 days and then get to work examining these types of issues; I will beef up the R pipeline (see #11 ) in preparation for this process.

About your no. 1 above - please don't change the Picarro time zone! Ugh. Much easier to deal with this in other ways.

bpbond commented 8 years ago

So, re 1 and 2 above - purely cosmetic in that one diagnostic plot. We can fix it, but also fine to ignore. Re 4 - see my comment in issue #13 Re 3 - I'm unclear. Here's your plot (below), but I don't see any problem in any of the diagnostics for 27 October. What if anything should be done here? I'm concerned if cores are potentially mis-labeled of course.

27 oct picarro readings

apeyton commented 8 years ago

I think the issue isn't shown in this individual plot - I was comparing this plot with the replication plot. There were not 6 replications for the cores 23, 18, 15, 21, 9, 37. So, we need to figure out the script that mislabeled the ports/time/core or remove the wonky from the data all together. Makes any sense?

replication 27 oct 2015

bpbond commented 8 years ago

Ah, got it. OK, let me look at those. Hmm:

       Core       Date n
1     AL 15 2015-10-27 6
2     AL 18 2015-10-27 6
3     AL 21 2015-10-27 6
4     AL 23 2015-10-27 6
5     AL 37 2015-10-27 6
6      AL 9 2015-10-27 6
7 Ambient22 2015-10-27 6

Looking just at AL 15 from that date:

  samplenum            DATETIME   N MPVPosition h2o_reported valvemaprow  min_CO2
1      1075 2015-10-27 18:06:39 115          12    2.5263508         552 499.5770
2      1081 2015-10-27 18:18:36 120          12    2.5524665         552 502.8194
3      1101 2015-10-27 20:27:27  86          12    0.8356557         552 460.7885
4      1107 2015-10-27 20:39:28  81          12    0.8054356         552 464.4229
5      1127 2015-10-27 21:55:17  85          12    0.8209873         552 457.9037
6      1133 2015-10-27 22:07:15  83          12    0.8240117         552 456.8838

Then when I look at those sample numbers in the Picarro data, they're all from valve 12 which is in fact AL 15 on that date. They appear in five different files (I have opened up the raw files and verified this):

[1] "CFADS2283-20151027-171013Z-DataLog_User.dat.gz"
[2] "CFADS2283-20151027-181018Z-DataLog_User.dat.gz"
[3] "CFADS2283-20151027-200015Z-DataLog_User.dat.gz"
[4] "CFADS2283-20151027-210022Z-DataLog_User.dat.gz"
[5] "CFADS2283-20151027-220030Z-DataLog_User.dat.gz"

It does look like there were six different samples taken from AL15 across five hours. Thoughts @apeyton ?

bpbond commented 8 years ago

Comment from @apeyton on Slack--until further notice ball is in her court on this one.

from those cores...or from those ports? Raw Picarro data only takes the port/valve into consideration. There were def. 6 different samples taken, but not from those cores...from those ports/valves. I think I need to recheck the core weights/times sheet with the raw Picarro data to make sure we accurately aligned our cores with the ports and raw data.

apeyton commented 8 years ago

I checked the times recorded for 27Oct2015 and I think I figured out why it assumed that there were 6 replications for the cores 23, 18, 15, 21, 9, 37 based on the times I recorded: It just assumes that everything past the time recorded is assigned to the core, but that is not correct. Perhaps we need to be in an 'end time' in the script. Basically, only assign valves/ports to cores based on only the 30 minutes following the time recorded. This would ensure that we are only collecting the measurements that we select (that are accurate!) and not the repeated mess up measurements made.

Thoughts?

apeyton commented 8 years ago

typo - "add in an 'end time'"

bpbond commented 8 years ago

Hi @apeyton I'd like to tie this up.

Yes, the script assumes that any measurement made at or later than "Time_set_start_UTC" (in the valvemap.csv file), made on the same day, and with a matching valve number should be assigned to that particular core. Relevant code is in 2-summarize.R:101-103:

  rowmatches <- which(DATETIME >= valvemap$StartDateTime & 
                        yday(DATETIME) == yday(valvemap$StartDateTime) &
                        MPVPosition == valvemap$MPVPosition)

but that is not correct

So what should be done? You suggest that the match should be made only for 30 minutes after the start time. OK, but then what should be done with measurements outside this window?

Thanks, B

apeyton commented 8 years ago

I think the measurements outside this window should be tossed.

bpbond commented 8 years ago

Why? I don't understand. What's special about 30 minutes?

apeyton commented 8 years ago

Update on issue: 30Mar2016 phone call 1) Cores for each treatment were measured in duplicate totaling 24 mins (per treatment) 2) If Picarro is recording core data past 30 mins from time of first measurement then there is an error that we need to identify 3) Updating CoreData.csv to list picarro data to remove from analysis (i.e. removing runs where there were picarro reading errors) to help solve many of the problems identified in this issue (list from Nov 15th). 4) Measurements made after 30 minutes may correspond to valve 10 - the ambient valve that measures continually between treatment measurements.