Brainstorming: general checks to do on the database

kaijagahm commented 3 years ago

Keep adding to this as we think of more

No duplicate invertID's in BENTHIC_INVERTS
No duplicate fishID's in FISH_INFO
No duplicate sampleID's in SAMPLES, BENTHIC_INVERT_SAMPLES, SED_TRAP_SAMPLES, or FISH_SAMPLES
No duplicate trailerID's in CREEL_TRAILERS
Are all metadataID's in the metadataID columns represented in METADATA?
Are all metadataID's in METADATA represented somewhere else in the database?
Are all updateID's in all the tables included in UPDATE_METADATA? [note: NOT the other way around]
Is dateTimeSample/dateSample always later than dateTimeSet/dateSet? EDIT: addressed by timeTravel() in utilities
Can the sampleID's be reconstructed from their component parts in the *SAMPLES tables?
Are all sampleID's from BENTHIC_INVERTS represented in BENTHIC_INVERT_SAMPLES?
Are all sampleID's from the various data tables represented in SAMPLES?
Are all sampleID's from FISH_INFO represented in FISH_SAMPLES?
Make sure that invertID's and fishID's are properly composed of sampleID and invertNum/fishNum. I came across a few rows where the invertID was just the invertNum (i.e. an integer) and didn't include the sampleID at all.

Randinotte commented 3 years ago

I have been coincidentally writing check functions during this last db update in order to be more meticulous. So, these may not be exactly in the form that we would want them for a broad check tool, but they're at least a little fleshed out. In my QCfuns.R script I have:

checkMet(): checks that metadata in a table are in METADATA
recreateSampleIDs(): checks that sampleIDs can be recreated with component parts
parseSampleID(): helper for recreateSampleIDs that breaks down a sampleID string
checkINFO(): checks that all sampleIDs from an INFO table are in the corresponding SAMPLES table
checkDuplicates(): checks for/identifies duplicates in a column (really simple, but frequently used)
findID(): finds all tables/column combinations where a given string is present (eg. "find all places this sampleID exists in the db")
changeSomething(): relatively specific use, but finds all instances of a string in a table and corrects it and assigns a fix updateID

So these would take care of some:

checkDuplicates()

No duplicate invertID's in BENTHIC_INVERTS
No duplicate fishID's in FISH_INFO
No duplicate sampleID's in SAMPLES, BENTHIC_INVERT_SAMPLES, SED_TRAP_SAMPLES, or FISH_SAMPLES

checkMetadata()

Are all metadataID's in the metadataID columns represented in METADATA?

replicateSampleIDs

Can the sampleID's be reconstructed from their component parts in the *SAMPLES tables?
Make sure that invertID's and fishID's are properly composed of sampleID and invertNum/fishNum. I came across a few rows where the invertID was just the invertNum (i.e. an integer) and didn't include the sampleID at all.

checkINFO()

Are all sampleID's from BENTHIC_INVERTS represented in BENTHIC_INVERT_SAMPLES?
Are all sampleID's from the various data tables represented in SAMPLES?
Are all sampleID's from FISH_INFO represented in FISH_SAMPLES?

not covered

Are all metadataID's in METADATA represented somewhere else in the database?
Are all updateID's in all the tables included in UPDATE_METADATA? [note: NOT the other way around]
Is dateTimeSample/dateSample always later than dateTimeSet/dateSet?

Randinotte commented 3 years ago

The fun part is that there are differences in format between tables, mostly in sampleID structure, so what works for a sampleID string in FISH doesn't equate with WATER_CHEM without some attention to detail. Just something to watch out for as we continue...

kaijagahm commented 3 years ago

A more general note about all of this: we really should have a check, either in the SQL script itself or in a separate R function, that makes sure the primary key for each table (either one column or a composite primary key) is actually unique. I think that would go a long way to resolving a lot of these issues. See for example #103 .

kaijagahm commented 3 years ago

Another one, inspired by #102: check for duplicate metadataID's in METADATA. But this also speaks to my previous comment, on preventing duplicates in the primary key columns in general.

kaijagahm commented 3 years ago

Inspired by #104 and the weird fish problem:

no duplicates in pitApply
no duplicates in floyApply
if a tag number appears in pitRecapture, it has to have previously appeared in pitApply [* unless someone didn't write it down? So what's the contingency here... maybe if it hasn't appeared before, you add a comment that says it hasn't appeared before?]
if a tag number appears in floyRecapture, it has to have previously appeared in floyApply [but same caveat as above].
if a fish is recaptured in a different basin than where it was originally tagged, add a note to jumperDescription.

kaijagahm commented 3 years ago

General note: I think a good way to aggregate all these check functions would be to write a package. I have done this once now and feel relatively confident in my ability to get a simple package off the ground.

In response to @Randinotte's points above about the formatting being different between different tables: I think this would be relatively straightforward to tackle with some extra function arguments. So for example, if there's a function that needs to behave differently depending on the format of the sampleID, you could just have an argument 'type' or something, which could take values 'fish', 'regular', or something along those lines.

Randinotte commented 3 years ago

@kaijagahm I have an argument fish = t/f but it feels a little clunky. But, might be a starting place when we're considering this. I know it's in recreateSampleID(), maybe some others.

kaijagahm commented 3 years ago

I wrote some check functions for date and time checking (dateTimeSample, dateTimeSet, sampleID), as well as some helpers (check whether something is a data frame; check whether columns exist in a given data frame). They're in utilities, in the 'checks.R' script. Documented minimally in the README.

kaijagahm commented 3 years ago

@Randinotte would you consider uploading some of the functions you've written to the utilities repo? I might start aggregating them into a package, because that will help with documentation and consistency, and it really won't take very long--writing the functions is the bulk of the work.

Randinotte commented 3 years ago

Sent the script via slack in our direct messages because I didn't have permission to upload to the utilities repo.

kaijagahm commented 3 years ago

Added Randi's functions to the utilities repo.

Randinotte commented 3 years ago

An idea- a way to check a script for unfinished case_when() ... TRUE ~ xyz

kaijagahm commented 3 years ago

Talked with Stuart today, 22 Jan 2021:

No, it was not a deliberate decision to remove database constraints. Those were probably imposed in Access, and they didn't know how to recreate them in SQLite.
Can move toward implementing those constraints, one at a time.

Types of constraints: NOT NULL: Makes sure that there are no null values in a given column. Should use that for:

sampleID
fishID
metadataID
updateID
Others? We should be pretty conservative here.

PRIMARY KEY: can specify a column or a group of columns, in the SQL script, to serve as a primary key. Code is like this:

CREATE TABLE table_name(
   column_1 INTEGER NOT NULL,
   column_2 INTEGER NOT NULL,
   ...
   PRIMARY KEY(column_1,column_2,...)
);

Note that in SQLite, the PRIMARY KEY constraint does not imply NOT NULL, as it does in SQL. You have to specify that separately. PRIMARY KEY does automatically imply UNIQUE

UNIQUE Makes sure the value is unique. Can use this quite a bit in our database!

FOREIGN KEY "A FOREIGN KEY is a field (or collection of fields) in one table that refers to the PRIMARY KEY in another table." (source)

prevents invalid data from being inserted without the reference existing
prevents destroying info

CHECK I don't have a good handle on how to use this one yet. You can do obvious things, like limit a numeric column to being less than/greater than/within a certain range. But I think you can also set constraints based on other columns? So I could imagine doing something like "if the depth class is a certain thing, disallow depths below X". But maybe that's too granular.

kaijagahm commented 3 years ago

Created this google sheet to keep track of some of the fixes we described above. Added a sheet to it to track each table's primary keys.

Next, need to have someone check that these are the correct primary keys. Also need to check each primary key to see if it has duplicates.

Then, maybe do the same with foreign keys/between-table references. Could start by adding in some of the low-hanging fruit: designating primary keys in all tables where we can.

kaijagahm commented 3 years ago

As of 1/26 meeting: Stuart and Chris are going to check that I have the primary keys correct (see sheet linked in the previous comment)

Stuart brought up the question of error messages from SQL. When you enforce a constraint and it fails, e.g. because there are non-unique values in a column that's supposed to be a primary key, what does SQL tell you?

I tried this out by copying the currentDB folder and adding a PRIMARY KEY constraint to BACTERIAL_PRODUCTION_BENTHIC on sampleID, which we know has some duplicates. Then I tried to create the database. As expected, this failed, because PRIMARY KEY implicitly enforces UNIQUE as well.

@joneslabND was worried that the error messages might not be informative enough about where the error occurred. He's right: although the error message does tell you which table failed, it doesn't tell you that it was the PRIMARY KEY constraint that was the culprit. You have to know that PRIMARY KEY implies UNIQUE for this error message to make sense:

However! There's an unexpected benefit. The SQL error messages tell you which rows are duplicates. This could be overwhelming if you had a whole bunch of duplicates, but (as will usually happen) if there are only a few duplicates, this could be really useful for identifying them. If you look closely, the message also tells you which column is the culprit: BACTERIAL_PRODUCTION_BENTHIC.sampleID. That will be useful if we, say, put a PRIMARY KEY constraint on one column and a separate UNIQUE constraint on a different column.

We should think more about whether R scripts with more informative error messages will have a role to play as well, but for now I'm actually fairly happy with the info that SQL provides.

Randinotte commented 3 years ago

Jogged my memory so I'll add- @kaijagahm remember that weird csv that had a problem with delimiters that that threw off the number of columns in the tables and made errors in the formation of the .db file? It gave a similarly formatted list of rows with an error message along the lines of "TABLE_NAME.txt:##: number of columns. Expected ## number of columns and had ##". Thought I would pass that on in case we end up writing a list of potential SQL error messages.

kaijagahm commented 3 years ago

Database update 4.4.0 adds primary key constraints for many (but not all) tables.

Tables that now have a PK constraint: BACTERIAL_PRODUCTION_PELAGIC BENTHIC_INVERT_SAMPLES CHLOROPHYLL COLOR CREEL_BOATS CREEL_BOAT_SAMPLES CREEL_FISH CREEL_INFO CREEL_INTERVIEW CREEL_SAMPLES CREEL_TRAILERS CREEL_TRAILER_SAMPLES DRY_MASS_EQUATIONS FISH_DIETS FISH_INFO FISH_MORPHOMETRICS FISH_SAMPLES FISH_YOY FLIGHTS FLIGHTS_INFO FLIGHTS_SAMPLES ISOTOPE_BATCHES ISOTOPE_RESULTS ISOTOPE_SAMPLES_BENTHIC_INVERTS ISOTOPE_SAMPLES_DIC ISOTOPE_SAMPLES_FISH ISOTOPE_SAMPLES_METHANE ISOTOPE_SAMPLES_PERIPHYTON ISOTOPE_SAMPLES_POC ISOTOPE_SAMPLES_WATER ISOTOPE_SAMPLES_ZOOPS LAKES LAKES_GIS LAKE_BATHYMETRY LIMNO_PROFILES LIPID_EXTRACTIONS LIPID_SAMPLES METADATA MOLECULAR_SAMPLE OTU PIEZOMETERS_INSTALL PIEZOMETERS_LAKE PIEZOMETERS_SENSORS PIEZOMETERS_SURVEYING PIEZOMETERS_UPLAND PRIMARY_PRODUCTION_BENTHIC PROJECTS RHODAMINE RHODAMINE_RELEASE SAMPLES SEDIMENT SED_TRAP_SAMPLES SITES STAFF_GAUGES UNITS VERSION_HISTORY ZOOPS_COEFFICIENTS ZOOPS_LENGTHS

Tables that will not get a PK: CREW LITERATURE_DATA (removed from the database as of 4.4.0) MOLECULAR_PROCESSED (removed from the database as of 4.4.0) PUBLICATIONS_PRESENTATIONS UPDATE_METADATA

Tables that still need a primary key constraint defined: BACTERIAL_PRODUCTION_BENTHIC BENTHIC_INVERTS FISH_OTOLITHS GC SED_TRAP_DATA TPOC_DEPOSITION WATER_CHEM ZOOPS_ABUND_BIOMASS ZOOPS_PRODUCTION ZOOPS_SUBSAMPLE

I am creating a new script to address the latter category, called morePrimaryKeys_gh99.R.

kaijagahm commented 3 years ago

For SED_TRAP_SAMPLES:

Change trapID to 4 for the ones that are duplicated. (does this need to be implemented in both STS and STD?)

For TPOC_DEPOSITION:

Can randomly assign the deepHole ones to E, N, S, W etc. sites. They were on transects from different parts of the shoreline toward the middle, and in both EL and WL, the deepHole is around 50m from shore in each direction. So, randomly assign, and then make a note in the metadata.
Also, go ahead and change the column names so we just keep track of the distance from shore in its own column, instead of trying to force it into the sampleID structure
- Check whether these samples appear in SAMPLES, and if so, figure out what to do with that.

Randinotte commented 3 years ago

Sent the script via slack in our direct messages because I didn't have permission to upload to the utilities repo.

Whenever you get back to needing the functions in my QCfuns.R script (that you put in the utilities repo for me), let me know and I'll send you the most recent version. I added a quick "check updateID's" function and I'm still having a problem with my "recreate sampleID's" function", although you probably have one that works after dealing with that in the duplicate sampleID's issue for so long.

kaijagahm commented 3 years ago

Noting here: I've created a script, relationalChecks.R, that will perform foreign key, sampleID, and other checks on each of the database tables. The idea is to run it before each database update.

As of today:

all foreign key checks are in place
all sampleID reconstructions are in place
need to figure out how to deal with NA's that are allowed to be NA's, such as in depthBottom.
Think about other checks for each table
Continue correcting the ones that need help
Think about a check for times

kaijagahm commented 3 years ago

Successfully created a new mini database version, v3.5.1, that resolves almost all of the relational and primary key checks. It's not on Box yet, and I may not add it until the next version is up; we'll see.

Dealt with the NA thing by using replace_na() in the sCheck function. Also added a dateTime check that works well.

Checks that are still unresolved are listed in #130, and additional primary key checks are in the primary key checks script.

Brainstorming more checks that aren't in the database yet: Data types

projectID should be an integer
runID, replicate, fishNum, invertNum, passengerCount, boatCount, etc. should be restricted to integers

Allowed values

depthClass should be restricted to [bottle, distFromShore, epi, horiz.tow, hypo, meta, midEpi, piez, PML, point, sediment, staff, surface, tow, NA]
Restrict fish sex to [M, F, U, NA]
Standardize 1/0 vs T/F, and restrict those columns to accept only binary values [eh, I don't think this is super important]
In OTU, restrict habitat and grouping to a set number of values. Habitat: [benthic, empty, littoral, none, pelagic, terrestrial, unidentifiable, unknown]. Grouping: [amphibian, benthic_invertebrate, benthic_invertebrates, bird, empty, fish, invertebrate, mammal, none, snake, terrestrial, terrestrial_invertebrate, turtle, unidentifiable, unknown, vertebrate, zooplankton]. Clearly these need work!
In tables that have a parameter/value format, restrict the parameters to a certain allowed set. FISH_MORPHOMETRICS: [landmarks and measurements]. FISH_OTOLITHS: [ageAtCapture, annulusRadius, lengthAtCapture, otolithWeight, sulculGrooveLongAxisAnnulusIncrement, sulculGrooveLongAxisTotalRadius, sulculGrooveShortAxisAnnulusIncrement, sulculGrooveShortAxistotalRadius, totalRadius, weightAtCapture]. SED_TRAP_DATA: [C, N, P]. WATER_CHEM: [DOC, nitrate, particulateP, POC, PON, SRP, TN, TP]
restrict QCcode in WATER_CHEM to specific values [code1 through code15, or NA]

Range checks

depthTop and depthBottom shouldn't be deeper than X deepest value
In general, we could easily add range checks, if someone wanted to come up with them. Maybe a separate script?
Any percents should be restricted to a max of 100.

Foreign keys/agreement between columns

Could add taxonomy checks for all tables that include taxonomy, such as BENTHIC_INVERTS. Right now, we only have a foreign key for OTU. Separate question, I guess, is why we need to have those taxonomy columns in BENTHIC_INVERTS in the first place, since you could just do a join with OTU.
In BENTHIC_INVERTS and FISH_INFO, we should check that invertNum/fishNum/trailerNum/boatNum match up to the numbers in invertID/fishID/trailerID/boatID. Ditto for any other tables that use this system.
Check the speciesCode_wiDNR column
Check that CREEL_FISH$speciesCode matches the abbreviation column in OTU
Check DOY in CREEL_SAMPLES
Check that the UNITS table is correct! See also #72.

kaijagahm commented 3 years ago

Status update on the primary keys:

Randi fixed BPB. Ready to write that one out. Same total number of rows (272). 7 rows have new updateID.
Removed duplicates for ZOOPS_PRODUCTION, and removed the SD columns. This one's finished.
Following Randi's advice, I averaged the remaining duplicate P values together in SED_TRAP_DATA. We can now define the primary key as sampleID x parameter.
WATER_CHEM: Pulled out the two experiments and put them in the Experiments folder. Still need to finalize the two DOC rows--waiting for Randi to check data sheets.
ZOOPS_SUBSAMPLE: going to talk to Randi to see if we can use math to work out the duplicates in ZAB.

kaijagahm commented 3 years ago

Here are some pretty extensive notes from a meeting I had with Randi on 4/15 to try to resolve the remaining issues in ZOOPS_SUBSAMPLE/ZOOPS_ABUND_BIOMASS.

The color coding will hopefully be helpful in keeping track of all the sampleID's floating around. NOTE: the colors in the ggplot have nothing to do with the colored text!

zoopsNotes.20210415.docx

@Randinotte do you think you could follow up on the 2012 sampleID at some point and try to figure out which row in ABUND_BIOMASS actually got measured by re-doing the calculation with the lengths in LENGTHS? If that's not straightforward, we can talk about it again.

kaijagahm commented 3 years ago

Finished ZAB and ZS, based on our decision in the 4/20 meeting to average the values and leave a comment.

Now finished and ready for db update (assign primary keys in SQL; add checks in relationalChecks, etc)

BACTERIAL_PRODUCTION_BENTHIC
SED_TRAP_DATA
ZOOPS_ABUND_BIOMAS
ZOOPS_PRODUCTION
ZOOPS_SUBSAMPLE

Still to do:

GC (has its own issue--#118, so shouldn't stop us from closing this one)
FISH_OTOLITHS (has its own issue-- #134, so shouldn't stop us from closing this one)
WATER_CHEM: @Randinotte will check DOC data sheets for CB_DeepHole_20110630_1233_PML_0_Methane.Sample.20110601--we initially had two rows labeled as PML (i.e. depthBottom wasn't filled in) and one row as PML0 (depthBottom 0). Using the measurements in the depthBottom column, I filled these in retroactively so now we have, effectively, 3 rows with PML0. But Stuart and Chris think that the original PML rows probably should have been point0 instead of PML. This is the last remaining problem for WATER_CHEM, so once those data sheets are checked we can close this issue.

kaijagahm commented 3 years ago

@joneslabND there's just one loose end to tie up on this issue. You can ignore most of the above--this was long and complicated, and you don't need to read it all over.

The remaining problem is that we're trying to assign sampleID*parameter as the composite primary key for WATER_CHEM, and we have one remaining duplicate: CB_DeepHole_20110630_1233_PML_0_Methane.Sample.20110601 has two DOC measurements. This is a bit complicated because originally there were three measurements (a singleton and a pair) that got put through the DOC pipeline, so we now have an averaged value (17450) and a singleton value (17960) remaining. But both are assigned to PML_0.

Randi and I talked with you and Chris about this a while back, and we concluded that one of these values is supposed to be point_0 instead of PML_0. But I don't know which.

Here's all the DOC data from that date/lake. You can see those two PML_0 samples at the top.

There's no other 2011 DOC data from that lake. I thought about trying to compare the data to other years in the same lake, but the two values are pretty similar, and they're both at the same depth.

Do you think I should just arbitrarily assign one of them to point_0 and keep one as PML, or can you think of any way to check this against a data sheet? My impression is that there's no physical DOC data sheet, since the values that come out of the machine go directly into Excel (and I've already checked the raw Excel sheet--it lists both as PML). But maybe I'm wrong?

I'd love to get this one tied up in whatever way you think is appropriate.

joneslabND commented 3 years ago

make the top one the point one. Thanks!

Stuart

On Wed, Jun 9, 2021 at 11:45 AM Kaija Gahm @.***> wrote:

@joneslabND https://github.com/joneslabND there's just one loose end to tie up on this issue. You can ignore most of the above--this was long and complicated, and you don't need to read it all over.

The remaining problem is that we're trying to assign sampleID*parameter as the composite primary key for WATER_CHEM, and we have one remaining duplicate: CB_DeepHole_20110630_1233_PML_0_Methane.Sample.20110601 has two DOC measurements. This is a bit complicated because originally there were three measurements (a singleton and a pair) that got put through the DOC pipeline, so we now have an averaged value (17450) and a singleton value (17960) remaining. But both are assigned to PML_0.

Randi and I talked with you and Chris about this a while back, and we concluded that one of these values is supposed to be point_0 instead of PML_0. But I don't know which.

Here's all the DOC data from that sample. You can see those two PML_0 samples at the top.

[image: Screen Shot 2021-06-09 at 11 42 47 AM] https://user-images.githubusercontent.com/37053323/121386233-d2066f80-c917-11eb-8e66-ed9ba7590030.png

There's no other 2011 DOC data from that lake. I thought about trying to compare the data to other years in the same lake, but the two values are pretty similar, and they're both at the same depth.

Do you think I should just arbitrarily assign one of them to point_0 and keep one as PML, or can you think of any way to check this against a data sheet? My impression is that there's no physical DOC data sheet, since the values that come out of the machine go directly into Excel (and I've already checked the raw Excel sheet--it lists both as PML). But maybe I'm wrong?

I'd love to get this one tied up in whatever way you think is appropriate.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MFEh2o/db/issues/99#issuecomment-857817741, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIIYAYL6KD7AFUFT7XWUETTR6EABANCNFSM4UXMNPMA .

-- Stuart E. Jones Associate Professor University of Notre Dame Dept. of Biological Sciences 264 Galvin Life Sciences Notre Dame, IN 46556 @.*** (574) 631-5703

kaijagahm commented 3 years ago

Made this change, and wrote out the files. Went ahead and did their update_metadata descriptions in meta_4.7.0, in preparation for the database update.

Also incorporated the #153 fix into this script; see that issue for explanation.

kaijagahm commented 3 years ago

Finished the rest of these, and added primary keys/checks to the SQL and database check script, with database version 4.7.0, 6/29/2021.

MFEh2o / db

Brainstorming: general checks to do on the database #99