MFEh2o / db

**Contains the main issue tracker for the MFE DB!** Functions for interacting with the MFE database, in script format. (See also MFEUtilities, which is an R package that includes many/most of the same functions).
1 stars 0 forks source link

Brainstorming: general checks to do on the database #99

Closed kaijagahm closed 3 years ago

kaijagahm commented 3 years ago

Keep adding to this as we think of more

Randinotte commented 3 years ago

I have been coincidentally writing check functions during this last db update in order to be more meticulous. So, these may not be exactly in the form that we would want them for a broad check tool, but they're at least a little fleshed out. In my QCfuns.R script I have:

So these would take care of some:

checkDuplicates()

checkMetadata()

replicateSampleIDs

checkINFO()

not covered

Randinotte commented 3 years ago

The fun part is that there are differences in format between tables, mostly in sampleID structure, so what works for a sampleID string in FISH doesn't equate with WATER_CHEM without some attention to detail. Just something to watch out for as we continue...

kaijagahm commented 3 years ago

A more general note about all of this: we really should have a check, either in the SQL script itself or in a separate R function, that makes sure the primary key for each table (either one column or a composite primary key) is actually unique. I think that would go a long way to resolving a lot of these issues. See for example #103 .

kaijagahm commented 3 years ago

Another one, inspired by #102: check for duplicate metadataID's in METADATA. But this also speaks to my previous comment, on preventing duplicates in the primary key columns in general.

kaijagahm commented 3 years ago

Inspired by #104 and the weird fish problem:

kaijagahm commented 3 years ago

General note: I think a good way to aggregate all these check functions would be to write a package. I have done this once now and feel relatively confident in my ability to get a simple package off the ground.

In response to @Randinotte's points above about the formatting being different between different tables: I think this would be relatively straightforward to tackle with some extra function arguments. So for example, if there's a function that needs to behave differently depending on the format of the sampleID, you could just have an argument 'type' or something, which could take values 'fish', 'regular', or something along those lines.

Randinotte commented 3 years ago

@kaijagahm I have an argument fish = t/f but it feels a little clunky. But, might be a starting place when we're considering this. I know it's in recreateSampleID(), maybe some others.

kaijagahm commented 3 years ago

I wrote some check functions for date and time checking (dateTimeSample, dateTimeSet, sampleID), as well as some helpers (check whether something is a data frame; check whether columns exist in a given data frame). They're in utilities, in the 'checks.R' script. Documented minimally in the README.

kaijagahm commented 3 years ago

@Randinotte would you consider uploading some of the functions you've written to the utilities repo? I might start aggregating them into a package, because that will help with documentation and consistency, and it really won't take very long--writing the functions is the bulk of the work.

Randinotte commented 3 years ago

Sent the script via slack in our direct messages because I didn't have permission to upload to the utilities repo.

kaijagahm commented 3 years ago

Added Randi's functions to the utilities repo.

Randinotte commented 3 years ago

An idea- a way to check a script for unfinished case_when() ... TRUE ~ xyz

kaijagahm commented 3 years ago

Talked with Stuart today, 22 Jan 2021:

Types of constraints: NOT NULL: Makes sure that there are no null values in a given column. Should use that for:

PRIMARY KEY: can specify a column or a group of columns, in the SQL script, to serve as a primary key. Code is like this:

CREATE TABLE table_name(
   column_1 INTEGER NOT NULL,
   column_2 INTEGER NOT NULL,
   ...
   PRIMARY KEY(column_1,column_2,...)
);

Note that in SQLite, the PRIMARY KEY constraint does not imply NOT NULL, as it does in SQL. You have to specify that separately. PRIMARY KEY does automatically imply UNIQUE

UNIQUE Makes sure the value is unique. Can use this quite a bit in our database!

FOREIGN KEY "A FOREIGN KEY is a field (or collection of fields) in one table that refers to the PRIMARY KEY in another table." (source)

CHECK I don't have a good handle on how to use this one yet. You can do obvious things, like limit a numeric column to being less than/greater than/within a certain range. But I think you can also set constraints based on other columns? So I could imagine doing something like "if the depth class is a certain thing, disallow depths below X". But maybe that's too granular.

kaijagahm commented 3 years ago

Created this google sheet to keep track of some of the fixes we described above. Added a sheet to it to track each table's primary keys.

Next, need to have someone check that these are the correct primary keys. Also need to check each primary key to see if it has duplicates.

Then, maybe do the same with foreign keys/between-table references. Could start by adding in some of the low-hanging fruit: designating primary keys in all tables where we can.

kaijagahm commented 3 years ago

As of 1/26 meeting: Stuart and Chris are going to check that I have the primary keys correct (see sheet linked in the previous comment)

Stuart brought up the question of error messages from SQL. When you enforce a constraint and it fails, e.g. because there are non-unique values in a column that's supposed to be a primary key, what does SQL tell you?

I tried this out by copying the currentDB folder and adding a PRIMARY KEY constraint to BACTERIAL_PRODUCTION_BENTHIC on sampleID, which we know has some duplicates. Then I tried to create the database. As expected, this failed, because PRIMARY KEY implicitly enforces UNIQUE as well.

@joneslabND was worried that the error messages might not be informative enough about where the error occurred. He's right: although the error message does tell you which table failed, it doesn't tell you that it was the PRIMARY KEY constraint that was the culprit. You have to know that PRIMARY KEY implies UNIQUE for this error message to make sense:

Screen Shot 2021-01-26 at 10 44 28 AM

However! There's an unexpected benefit. The SQL error messages tell you which rows are duplicates. This could be overwhelming if you had a whole bunch of duplicates, but (as will usually happen) if there are only a few duplicates, this could be really useful for identifying them. If you look closely, the message also tells you which column is the culprit: BACTERIAL_PRODUCTION_BENTHIC.sampleID. That will be useful if we, say, put a PRIMARY KEY constraint on one column and a separate UNIQUE constraint on a different column.

We should think more about whether R scripts with more informative error messages will have a role to play as well, but for now I'm actually fairly happy with the info that SQL provides.

Randinotte commented 3 years ago

Jogged my memory so I'll add- @kaijagahm remember that weird csv that had a problem with delimiters that that threw off the number of columns in the tables and made errors in the formation of the .db file? It gave a similarly formatted list of rows with an error message along the lines of "TABLE_NAME.txt:##: number of columns. Expected ## number of columns and had ##". Thought I would pass that on in case we end up writing a list of potential SQL error messages.

kaijagahm commented 3 years ago

Database update 4.4.0 adds primary key constraints for many (but not all) tables.

Tables that now have a PK constraint: BACTERIAL_PRODUCTION_PELAGIC BENTHIC_INVERT_SAMPLES CHLOROPHYLL COLOR CREEL_BOATS CREEL_BOAT_SAMPLES CREEL_FISH CREEL_INFO CREEL_INTERVIEW CREEL_SAMPLES CREEL_TRAILERS CREEL_TRAILER_SAMPLES DRY_MASS_EQUATIONS FISH_DIETS FISH_INFO FISH_MORPHOMETRICS FISH_SAMPLES FISH_YOY FLIGHTS FLIGHTS_INFO FLIGHTS_SAMPLES ISOTOPE_BATCHES ISOTOPE_RESULTS ISOTOPE_SAMPLES_BENTHIC_INVERTS ISOTOPE_SAMPLES_DIC ISOTOPE_SAMPLES_FISH ISOTOPE_SAMPLES_METHANE ISOTOPE_SAMPLES_PERIPHYTON ISOTOPE_SAMPLES_POC ISOTOPE_SAMPLES_WATER ISOTOPE_SAMPLES_ZOOPS LAKES LAKES_GIS LAKE_BATHYMETRY LIMNO_PROFILES LIPID_EXTRACTIONS LIPID_SAMPLES METADATA MOLECULAR_SAMPLE OTU PIEZOMETERS_INSTALL PIEZOMETERS_LAKE PIEZOMETERS_SENSORS PIEZOMETERS_SURVEYING PIEZOMETERS_UPLAND PRIMARY_PRODUCTION_BENTHIC PROJECTS RHODAMINE RHODAMINE_RELEASE SAMPLES SEDIMENT SED_TRAP_SAMPLES SITES STAFF_GAUGES UNITS VERSION_HISTORY ZOOPS_COEFFICIENTS ZOOPS_LENGTHS

Tables that will not get a PK: CREW LITERATURE_DATA (removed from the database as of 4.4.0) MOLECULAR_PROCESSED (removed from the database as of 4.4.0) PUBLICATIONS_PRESENTATIONS UPDATE_METADATA

Tables that still need a primary key constraint defined: BACTERIAL_PRODUCTION_BENTHIC BENTHIC_INVERTS FISH_OTOLITHS GC SED_TRAP_DATA TPOC_DEPOSITION WATER_CHEM ZOOPS_ABUND_BIOMASS ZOOPS_PRODUCTION ZOOPS_SUBSAMPLE

I am creating a new script to address the latter category, called morePrimaryKeys_gh99.R.

kaijagahm commented 3 years ago

For SED_TRAP_SAMPLES:

For TPOC_DEPOSITION:

Randinotte commented 3 years ago

Sent the script via slack in our direct messages because I didn't have permission to upload to the utilities repo.

Whenever you get back to needing the functions in my QCfuns.R script (that you put in the utilities repo for me), let me know and I'll send you the most recent version. I added a quick "check updateID's" function and I'm still having a problem with my "recreate sampleID's" function", although you probably have one that works after dealing with that in the duplicate sampleID's issue for so long.

kaijagahm commented 3 years ago

Noting here: I've created a script, relationalChecks.R, that will perform foreign key, sampleID, and other checks on each of the database tables. The idea is to run it before each database update.

As of today:

kaijagahm commented 3 years ago

Successfully created a new mini database version, v3.5.1, that resolves almost all of the relational and primary key checks. It's not on Box yet, and I may not add it until the next version is up; we'll see.

Dealt with the NA thing by using replace_na() in the sCheck function. Also added a dateTime check that works well.

Checks that are still unresolved are listed in #130, and additional primary key checks are in the primary key checks script.

Brainstorming more checks that aren't in the database yet: Data types

Allowed values

Range checks

Foreign keys/agreement between columns

kaijagahm commented 3 years ago

Status update on the primary keys:

kaijagahm commented 3 years ago

Here are some pretty extensive notes from a meeting I had with Randi on 4/15 to try to resolve the remaining issues in ZOOPS_SUBSAMPLE/ZOOPS_ABUND_BIOMASS.

The color coding will hopefully be helpful in keeping track of all the sampleID's floating around. NOTE: the colors in the ggplot have nothing to do with the colored text!

zoopsNotes.20210415.docx

@Randinotte do you think you could follow up on the 2012 sampleID at some point and try to figure out which row in ABUND_BIOMASS actually got measured by re-doing the calculation with the lengths in LENGTHS? If that's not straightforward, we can talk about it again.

kaijagahm commented 3 years ago

Finished ZAB and ZS, based on our decision in the 4/20 meeting to average the values and leave a comment.

Now finished and ready for db update (assign primary keys in SQL; add checks in relationalChecks, etc)

Still to do:

kaijagahm commented 3 years ago

@joneslabND there's just one loose end to tie up on this issue. You can ignore most of the above--this was long and complicated, and you don't need to read it all over.

The remaining problem is that we're trying to assign sampleID*parameter as the composite primary key for WATER_CHEM, and we have one remaining duplicate: CB_DeepHole_20110630_1233_PML_0_Methane.Sample.20110601 has two DOC measurements. This is a bit complicated because originally there were three measurements (a singleton and a pair) that got put through the DOC pipeline, so we now have an averaged value (17450) and a singleton value (17960) remaining. But both are assigned to PML_0.

Randi and I talked with you and Chris about this a while back, and we concluded that one of these values is supposed to be point_0 instead of PML_0. But I don't know which.

Here's all the DOC data from that date/lake. You can see those two PML_0 samples at the top.

Screen Shot 2021-06-09 at 11 42 47 AM

There's no other 2011 DOC data from that lake. I thought about trying to compare the data to other years in the same lake, but the two values are pretty similar, and they're both at the same depth.

Do you think I should just arbitrarily assign one of them to point_0 and keep one as PML, or can you think of any way to check this against a data sheet? My impression is that there's no physical DOC data sheet, since the values that come out of the machine go directly into Excel (and I've already checked the raw Excel sheet--it lists both as PML). But maybe I'm wrong?

I'd love to get this one tied up in whatever way you think is appropriate.

joneslabND commented 3 years ago

make the top one the point one. Thanks!

Stuart

On Wed, Jun 9, 2021 at 11:45 AM Kaija Gahm @.***> wrote:

@joneslabND https://github.com/joneslabND there's just one loose end to tie up on this issue. You can ignore most of the above--this was long and complicated, and you don't need to read it all over.

The remaining problem is that we're trying to assign sampleID*parameter as the composite primary key for WATER_CHEM, and we have one remaining duplicate: CB_DeepHole_20110630_1233_PML_0_Methane.Sample.20110601 has two DOC measurements. This is a bit complicated because originally there were three measurements (a singleton and a pair) that got put through the DOC pipeline, so we now have an averaged value (17450) and a singleton value (17960) remaining. But both are assigned to PML_0.

Randi and I talked with you and Chris about this a while back, and we concluded that one of these values is supposed to be point_0 instead of PML_0. But I don't know which.

Here's all the DOC data from that sample. You can see those two PML_0 samples at the top.

[image: Screen Shot 2021-06-09 at 11 42 47 AM] https://user-images.githubusercontent.com/37053323/121386233-d2066f80-c917-11eb-8e66-ed9ba7590030.png

There's no other 2011 DOC data from that lake. I thought about trying to compare the data to other years in the same lake, but the two values are pretty similar, and they're both at the same depth.

Do you think I should just arbitrarily assign one of them to point_0 and keep one as PML, or can you think of any way to check this against a data sheet? My impression is that there's no physical DOC data sheet, since the values that come out of the machine go directly into Excel (and I've already checked the raw Excel sheet--it lists both as PML). But maybe I'm wrong?

I'd love to get this one tied up in whatever way you think is appropriate.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MFEh2o/db/issues/99#issuecomment-857817741, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIIYAYL6KD7AFUFT7XWUETTR6EABANCNFSM4UXMNPMA .

-- Stuart E. Jones Associate Professor University of Notre Dame Dept. of Biological Sciences 264 Galvin Life Sciences Notre Dame, IN 46556 @.*** (574) 631-5703

kaijagahm commented 3 years ago

Made this change, and wrote out the files. Went ahead and did their update_metadata descriptions in meta_4.7.0, in preparation for the database update.

Also incorporated the #153 fix into this script; see that issue for explanation.

kaijagahm commented 3 years ago

Finished the rest of these, and added primary keys/checks to the SQL and database check script, with database version 4.7.0, 6/29/2021.