Closed kaijagahm closed 3 years ago
I made a mistake here: the dates in the sampleID's were correct. What I was getting confused about was some problematic dates in the metadataID's. I think those don't matter as much, so I'm going to leave them as is.
For the record, the problems were
[1] "unknown" "DIsedcores.Samples.2010115" "dvm.survey.05122013"
[4] "crazy.Chiro2015.2016"
"unknown" obviously stays that way. "DIsedcores.Samples.2010115" should probably be "DIsedcores.Samples.20100115", but maybe it should be "DIsedcores.Samples.20101015"--maaaaybe we want to change this? But it doesn't seem worth it. "dvm.survey.05122013" should ideally be "dvm.survey.20130512", but again, maybe not worth the change.
Checked FISH_SAMPLES, and the dates are all correct. MetadataID dates are also all correct, except for another "unknown" and for one metadataID that had a trailing space. I checked that metadataID in METADATA--it does not have a trailing space in that table. So I fixed the trailing space in FISH_SAMPLES for those samples--three of them had the trailing space. There were some with that metadataID that did not have the trailing space.
Before I finish updating this, Stuart suggests decomposing sampleID's (or, I suppose, concatenating columns together to re-create the sampleID's) and checking them against the existing ones to see if there are any inconsistencies.
@joneslabND you were very smart to have me check all the sampleID's--unfortunately, I found a whole bunch of errors, both in SAMPLES and FISH_SAMPLES. To give you a sense, there are 1781 problematic sampleID's in SAMPLES, and 5015 in FISH_SAMPLES. Most of the problems fall into one of a few categories. Namely:
sampleID's don't include depthBottom: where that part of the sampleID should be e.g. _0_
, it is instead two underscores: __
. In most if not all of these cases, the depthBottom field is not empty, so it's unclear why those values would have been omitted. I looked at the values and I couldn't find a pattern--it's a mix of integers and decimal numbers.
the formatting of the depthClass part of the sampleID doesn't agree with the formatting of the value in the depthClass column. For example, we might have _horiz.tow_
in the sampleID, but then the value in depthClass is horizTow
. I think so far I've only found that to be the case for horiz.tow
and horiz tow
, not for any other depthClass values, but there may be others. Note that resolving this is probably going to involve also solving #53 at the same time, in order to avoid re-creating this error by applying corrections to depthClass later on and pulling it out of sync with the sampleID's.
the metadataID in the table doesn't agree with the metadataID portion of the sampleID. This error accounts for the vast majority of problematic sampleIDs.
times that don't match up. In all cases that I've found yet, the sampleID is exactly 1 hour later than the time recorded in dateTimeSample. Could this be a time zone thing or something?
A couple random one-off errors: errant spaces in the sampleID's, n/a
instead of 0
for the depthBottom portion, leading zero missing from the times (e.g. 925
instead of 0925
).
@joneslabND @ctsolomon as far as I can tell, the way to address these will be to 1) fix the problems (maybe one category at a time?), and then 2) search for any instances of the old sampleID's in other tables and correct them there as well. But because this seems to be a really large issue that has the potential to touch most of the database tables, I wanted to definitely check with you before proceeding.
Yikes. Thanks both for catching this.
I think probably the thing to do is to fix these errors, as you suggest. One hesitation is that any code that we have that calls on sampleIDs (e.g. probably some mark recapture analyses, probably lots of other things) will become difficult to port into the future. Such code should still run on the version of the database it was written for, but trying to run an updated version of the code on a new version of the database could be a nightmare. I think that’s probably a price we just pay, but maybe worth a little thought (and conversation with the group?) before we jump in.
I agree that this will be a mess--thanks for reminding me that this will cause issues with code, not just with the database tables themselves.
This won't solve the issue, but one thing that I should be able to do in order to mitigate the situation is to create a lookup table, where there's one column for the original (problematic) sampleID's, and another column for the corrected ones (once we make the changes). Then, I can write a wrapper function that takes input of an original sampleID, looks it up in the table, and returns the corresponding corrected sampleID. That way, if people did want to go back and change their old code, they could do so just by wrapping sampleID's in that function. Still a pain, I know, but it would be better than having to change them all manually.
So for example, I'm thinking that if I write a function called, say, correct
, code like this:
obj <- "old_sample_ID.20200630"
could be changed to
obj <- correct("old_sample_ID.20200630")
Anyway, that's just a thought--we can address this down the line. For now, I'll plan to save discussion of this for Tuesday's meeting, unless @ctsolomon @joneslabND you think that we should call a larger meeting, maybe including some of the grad students, to talk about the implications of this change.
Yes, the lookup table is a good idea.
Let's start by just talking about this at Tuesday meeting and decide from there whether we need to bother with larger group.
I'm on board discussing this Tuesday and in general with the fixes proposed by Kaija. The lookup table could be useful as a "just in case" reference. I'm not sure how many analyses will rely on the sampleID. Maybe more for fish analyses that I'm not aware of?
@joneslabND @ctsolomon Ran into an interesting issue that we didn't discuss at the meeting. There are 15 rows with siteID's, associated with metadataID == "Limno.Sample.20130416", that have spaces in them. The siteID's are:
"BA_dark 12" "BA_dark 18" "BA_dark 3" "BA_dark 6" "BA_dark control" "BA_light 12" "BA_light 18" "BA_light 3" "BA_light 6" "BA_light control"
It looks like whoever originally made the sampleID's removed the spaces from those siteID's, so when I recreate the sampleID's by simply pasting together the various columns, I retain the spaces and therefore the sampleID doesn't match.
It seems to me that for the sake of consistency, it would be better if siteID's never had spaces in them, in which case I'd go in and remove the spaces from these siteID's where they occur. The alternative is to leave things as is, and just allow for some manual fixing when sampleID's are created from siteID's like this, which have spaces. I guess I just worry that at some point someone will forget to do that and we'll end up with sampleID's that have errant spaces, which could maybe pose a problem for analysis later. What do you think?
Seems reasonable to me to remove the spaces from the siteIDs. Perhaps also we need a note in the database rules file indicating that that is the convention.
Those siteIDs look to me like they must have come from some kind of experiment. Do we know enough about them to confirm that the "Limno.Sample..." metadataID is an accurate description of how those samples were collected?
@ctsolomon I had the same thought--it seemed like an experiment, not like routine limno sampling. When I get the chance later today, I'll see what data are attached to these sampleID's (I keep meaning to write a function to do that).
My first guess is that they could be from Patrick's failed experiment, "Limno Corrals Bay", since I think BA is Bay Lake. But the projects table puts that project in 2014, not 2013.
I've almost finished correcting the sampleID inconsistencies in SAMPLES, before moving on to fixing FISH_SAMPLES. Here are a few remaining problems that I'm not sure how to deal with. @ctsolomon @joneslabND @Randinotte any ideas?
There's one sampleID, "BO_DeepHole_20120727_1530_PML_2_MethSurv.Sample.20120522", that has Limno.Sample.20120501 in its metadataID column. I think we went over this at the meeting, but just to make sure: MethSurv.Sample.20120522 is a valid metadataID for the SAMPLES table, and I should change metadataID back to MethSurv.Sample.20120522 in this case, correct?
Several samples (from 2019) have "Staff.Gauge.Sample.20140319" in the metadataID column, but the metadataID portion of the sampleID is "Limno.Sample.20160505" instead. Which is correct?
Similarly, several samples (from 2019) have "DOC.20110601" in the metadataID column, but the metadata portion of the sampleID is "Limno.Sample.20160505". Which is correct?
There's one sampleID, FE_DeepHole_20190724_1054_tow_6_Limno.Sample.20160505, which has "6" as the depthBottom portion of the sampleID, but the value in the depthBottom column is "8". Should we go back to the original data sheet to figure out which of these is correct?
As long as a number of other samples have the MethSurv.Sample metadataID in the metdataID column then that is a good change.
It looks to me like the sampleID is wrong. There is no information about Staff.Gauge (reading the ruler in the lake that tells us if the depth of the lake has changed) sampling in Limno metadata. I guess we should change the sample ID on these. It makes me wonder what all the other staff gauge sampleIDs have for the metadata portion and what they have in the metadataID column of the sample table. I guess we could create a new issue to double check that all staff gauge sample IDs have the correct metadataID portion and that those samples have the correct metadataID in that column of the samples table.
I would guess that these need to have the metadataID column entry changed to Limno.Sample
Yes, we should look at the sample sheet. I'll put this on a list for me to check when I get into the lab
On the BA sites with spaces. I agree these look like the Kelly experiment. Alternatively, it could be the nutrient limitation experiment. What project are they associated with?
On the BA sites with spaces. I agree these look like the Kelly experiment. Alternatively, it could be the nutrient limitation experiment. What project are they associated with?
They're associated with project 20. So that is indeed Patrick's limno corrals. @joneslabND you said that was a failed experiment. It looks like there's at least some data, though not much. Does this mean we need to change the start year of the limno corrals project to 2013 in the PROJECTS table (it's currently 2014)?
Those siteIDs look to me like they must have come from some kind of experiment. Do we know enough about them to confirm that the "Limno.Sample..." metadataID is an accurate description of how those samples were collected?
@ctsolomon I don't know much about Patrick's project, and I haven't been able to find any associated metadata files (Patrick didn't email me back, and I stopped bugging him because he seemed very busy). Stuart says this project never really got off the ground, so it's possible metadata files don't exist? In any case, unless @joneslabND knows better, it seems pretty unlikely that this project would be associated with routine limno sampling, but I'm not sure what to substitute in.
Yes, please - change the year. I guess it was 2013 and not 2014. It only ran in one summer.
I think he did some limno sampling from the mesocosms, so I think it is ok to leave that as the metatdata ID.
Okay, I've addressed all of these as decided above, and created new issues for changing the year on project 20 (and any other future fixes to PROJECTS that we may come across) and checking all the staff gauge samples.
Still to address in SAMPLES:
Now on to FISH_SAMPLES!
@joneslabND also for the list of data sheets to check: "KT_FN3_20170628_1015_FN_GonadSurvey.20170915" (in FISH_SAMPLES) says 1015 in the sampleID, but based on the dateTimeSample column, it looks like it should be 1035. Not entirely sure which is correct.
"MK_FNun_20160601_1445_FN_GonadSurvey.20160623" (in FISH_SAMPLES) has the date 20180601 in the sampleID, but the date in dateTimeSample and dateSample is 20180531. Randi had thought that this might be a case where the person entering the data got confused and accidentally used the date from dateTimeSet instead of dateTimeSample, but the date in dateTimeSet is, as expected, 20180530, not 0531 or 0601. I think it would be good to check the original data sheet to make sure this isn't anything beyond just a simple typo/error.
The time issue continues to be very cryptic. I can't recreate it. It can't be a time zone issue because the discrepancy is in the wrong direction. Going to go the route of just changing the times back to what they are in the sampleID's, and then verifying a random sample of sampleID's against the original data sheets to make sure it's correct. If/when I have time to go back and work on the workflow as a whole, I'll work on making sure the times line up correctly. Possibly converting to UTC, or keeping everything as a text string all the time (suboptimal if you need to do time calculations...)
Also changed Chelsea's metadata dragdown errors: 0627, 0628, 0629, and 0630 --> 0626. [EDIT: see also #93].
Apart from the above-mentioned checks, and a check that needs to be done on randomly-selected rows, I've completed this. Next step will be combing through these to figure out which other tables will be affected by the changes, and systematically making those changes. This will involve updating the sampleID's using the reference tables, but also probably updating some siteID's, datetimes, etc.
EDIT:
@joneslabND, when you get the chance, could you check the original data sheets for these two text files? 10 rows in each; one is from SAMPLES and the other is from FISH_SAMPLES. The rows are sorted by dateTimeSample. NOTE: when you're looking at the original data sheets, DON'T look by sampleID! The sampleID's won't necessarily match up, because I've made some other fixes. The oldID
column in these text files is the original sampleID, before any of my changes/fixes. That's what you should rely on to link to the data sheets.
For each oldID in these files, I need to know the original dateSample and dateTimeSample.
Let me know if that's confusing! I'd be happy to clarify. And thank you. FISH_SAMPLES_testrows.txt SAMPLES_testrows.txt
Systematic updating of the other database tables that are linked to SAMPLES and FISH_SAMPLES:
Probably best to go over these changes, and think about the best ways to approach making them, at Tuesday's meeting before I implement them.
The SAMPLES_testrows.txt dateSample and DateTimeSample fields all matched the hard-copy data sheets.
The FISH_SAMPLES_testrows.txt dateSample and DateTimeSample fields: row 1 - matches the hard copy row 2 - I don't see a sheet with FF as crew; there are two sheets with AT & AS as crew and one sheet with MS & AK as crew. These have the dateSample right, but the dateTimeSamples are 10:00, 10:58, and 8:55 row 3 - matches the hard copy row 4 - the dateTimeSet is changed to 2019-06-21 10:00 and dateTimeSample is changed to "2019-06-23 3:30"; I assume this is 3:30 pm and so should be 15:30. I think this is probably an entry error and not an issue with the way you deconstructed sampleIDs to get dates and times; I recall that Amaryllis did some checking and revising of these data. We may want to circle back and double check with her on some of this. row 5 - matches the hard copy row 6 - dateTimeSample is good. dateTimeSample is 2019-07-11 02:25 and not 2019-07-11 01:25, but the effort column is correct row 7 - matches the hard copy row 8 - empty on the data sheet... could use the effort? --> follow up with Amaryllis row 9 - empty on the data sheet... could use the effort? --> follow up with Amaryllis row 10 - missing dateTimeSet on data sheet and no time on data sheet --> follow up with Amarylliss
For FISH_SAMLES it seems like deconstructing sampleIDs is generally ok, but Minnow Traps (MT) in 2019 had some problems
Thanks for checking these, Stuart!
I'm a little uncertain how to move forward here. These 10 rows aren't anything special--they are 10 randomly sampled rows from FISH_SAMPLES. I could go through each one individually and fix them according to the comments above, but I don't think that's productive because then I would have to check all other rows by hand too.
It seems like at least one of the problems above has to do with the minnow trap deploy times issue, which is being addressed separately in #60. So maybe I'll hold off and try re-running this once those samples are fixed. But as for the other ones, I'm not sure. Bring up at the meeting.
Note from 9/25/2020: In the course of adding the crazy chiro samples and creating the BENTHIC_INVERT_SAMPLES table (issue #37), I noticed some sampleID's that had __ in them (double underscore), where the depthBottom measurement was missing. There were 9 sampleID's, and they were:
[1] "WL_N_20110518_0000_sediment__zoobenthos.Isotope.Sample.20110714"
[2] "EL_Littoral_20110524_0000_sediment__zoobenthos.Sample.20110714"
[3] "WL_Littoral_20110524_0000_sediment__zoobenthos.Sample.20110714"
[4] "EL_Littoral_20110620_0000_sediment__zoobenthos.Sample.20110714"
[5] "WL_Littoral_20110620_0000_sediment__zoobenthos.Sample.20110714"
[6] "EL_Littoral_20110721_0000_sediment__zoobenthos.Sample.20110714"
[7] "WL_Littoral_20110721_0000_sediment__zoobenthos.Sample.20110714"
[8] "EL_Littoral_20110809_0000_sediment__zoobenthos.Sample.20110714"
[9] "WL_Littoral_20110809_0000_sediment__zoobenthos.Sample.20110714"
I corrected these, replacing _ with [depthBottom]_, in the script BENTHIC_INVERT_SAMPLES_CREATE_gh37.R. Will need to make sure this change gets recorded when we make the sampleID changes. I have already gone into the other tables that contained these sampleID's (ISOTOPE_SAMPLES_BENTHIC_INVERTS and LIPID_SAMPLES) to make the changes there. But we'll need to include these, and their fixed equivalents, in our reference table in case anyone wants to trace what we did here, or use the changes in their own scripts.
Ok, we took a long hiatus from this to work on other issues. I have now returned to it, doing a sweep through all of the *_SAMPLES tables to make sure their sampleID's are reconstructable.
The new script is called sampleIDsCheck_gh51.R. I'm not going to finish the old script, though I'll keep it for records.
FISH_SAMPLES: Most problems were taken care of by the various updates to the fish tables. Fixed the remaining problems. Ready to write this table out.
SAMPLES: Fixed all remaining problems, relying heavily on previous documentation in this issue, in my slack chat with Randi, and in the earlier version of the R script.
BENTHIC_INVERT_SAMPLES: Fixed all remaining problems.
SED_TRAP_SAMPLES: No fixes required.
CREEL_SAMPLES: Still working with @Randinotte to fix the time problems.
CREEL_TRAILER_SAMPLES: Added columns that make up the components of the sampleID.
CREEL_BOAT_SAMPLES: Added columns that make up the components of the sampleID.
FLIGHTS_SAMPLES: Added columns that make up the components of the sampleID.
Still to do:
Saved lookup tables for BENTHIC_INVERT_SAMPLES and SAMPLES, which are the only ones where sampleID's actually changed (for FISH_SAMPLES, I resolved the one problem by changing a value in the metadataID column, not in the sampleID itself).
Still have to fix the CREEL table in response to the spreadsheet that Randi sent, and then attend to the other to-do items above.
These are fixed with v 4.5.0.
Found some sampleID's where the date format is e.g. 05122013 instead of 20130512. This is a problem for extracting dates etc. from sampleID's.
So far, the problematic samples are in NUTRIENTS and are from dvm.survey. But I think it would be a good idea to go through all the tables and fix this.