PecanProject / bety

Web-interface to the Biofuel Ecophysiological Traits and Yields Database (used by PEcAn and TERRA REF)
https://www.betydb.org
BSD 3-Clause "New" or "Revised" License
16 stars 38 forks source link

Broken constraints #162

Closed robkooper closed 9 years ago

robkooper commented 9 years ago

When loading the database into a version of BETY with the current constraints it fails:

Loading  pfts_species              : ERROR:  insert or update on table
"pfts_species" violates foreign key constraint "species_exists" DETAIL:  Key (specie_id)=(965) is not present in table "species".
robkooper commented 9 years ago

If I did my SQL correct, here are the broken PFT's:

select pft_id, pfts.name, specie_id from pfts, pfts_species left join species on specie_id=species.id where species.id is null and pfts.id=pft_id;
 pft_id |         name          | specie_id 
--------+-----------------------+-----------
     31 | ebifarm.salix         |      1242
     39 | plants                |      1242
     69 | ebifarm.salix.doe_vd  |      1242
     62 | salix                 |      1242
     18 | temperate.Mid_Conifer |       965
(5 rows)
gsrohde commented 9 years ago

It's good we are talking about this now since I was about to add a bunch more foreign key constraints with the "NOT VALID" option. This allows one to add the constraint before cleaning up the data (allowing us to do "lazy" cleanup)--it only affects updates (including inserts and deletes), not existing data. But I didn't take into account how your database syncing script works.

dlebauer commented 9 years ago

Yes, these are orphaned keys - perhaps due to deletion of a duplicate species.

I think once we have the 'not valid' flag it will be easier to go through and identify these orphaned foreign keys.

easy fix for this case (I have applied)

delete from pfts_species where specie_id in (1242, 965)

@robkooper please close if this works now. Then we don't need to add a 'not valid' option in this case.

gsrohde commented 9 years ago

Of the 82 foreign key constraints I was about to add, I had to add the "NOT VALID" option to 27 of them because of needed cleanup of the database. It seems like the options are to delay adding these 27 constraints until we do the necessary database cleanup or coming up with a workaround to the problems the bad data cause to the load.bety.sh script.

robkooper commented 9 years ago

I think part of adding a constraint should be testing and fixing the data that is currently in the database. Adding a flag is kludge and will require in PEcAn to make sure we check the flag as well as in any other place in the BETY interface.

robkooper commented 9 years ago

The following pfts do not exist:

ebi_production=# select distinct pft_id, count(specie_id) from pfts_species left join pfts on pft_id=pfts.id where pfts.id is null group by pft_id order by pft_id;
 pft_id | count 
--------+-------
     29 |     1
     94 |    38
     98 |     1
    111 |    60
    112 |    60
    113 |    60
    114 |    60
    115 |    60
gsrohde commented 9 years ago

I was of course planning to do the "ALTER TABLE VALIDATE CONSTRAINT ___" statement (in a follow-up migration, I guess) on each "NOT VALID" constraint once the necessary cleanup was done (this removes the NOT VALID flag). Mainly I wanted to "get these constraints we've been talking about so long done and in place already"--we could at least in that way ensure we don't generate new bad data before we clean up the old bad data.

But it does seem that it would be a pain to get the data-syncing script to play nicely with the constraints before the data is cleaned up. I assume this is what you mean when you talk about "[requiring] in PEcAn to make sure we check the flag" since I can't think of any other case where it would especially matter whether the data has been cleaned up before adding the constraints or not. It's also hard to think of any likely scenario where not yet having cleaned up the data would adversely affect the BETYdb web interface.

dlebauer commented 9 years ago

in such cases the solution is fairly straightforward

delete from pfts_species where pft_id in (29, 94, 98, 111, 112, 113, 114, 115)

But in general, I agreed with Scott that fixing data at the same time we implement constraints would confound a well defined task (adding constraints) into an unconstrained task (fixing data) that will require personnel that we don't currently have ...

Is the problem that the data that is synced must respect the constraints? Is there a way to turn off constraints during import? I agree that this sounds kludgy, but at least it is tractable.

robkooper commented 9 years ago

The problem is that the script that is used to sync the database first loads the schema and then each subset of data from the psql file, inserting each row one at a time. If a constraint is placed it will validate the constraint against the data and thus fail inserting data that does pass the constraint.

robkooper commented 9 years ago

import of database succeeded with no more errors.

dlebauer commented 9 years ago

so ... this issue can be closed?

@gsrohde could you create an issue that summarizes the "NOT VALID" flags and other constraint violations that occur as you implement constraints on your development copy of the database? Perhaps start with one issue per constraint type, and create a list like:

Hopefully this will be a good place to start, and I can get 80% done in one sitting.

robkooper commented 9 years ago

Constraints are fixed, database loads.

gsrohde commented 9 years ago

Will do. FYI there are 25 "problem" foreign key constraints. If I find that many of them only involve a few rows, I may combine some of these into one issue if that's OK.

dlebauer commented 9 years ago

I may combine some of these into one issue if that's OK.

I meant one issue per constraint type (e.g. one issue for the foreign key constraints etc ... one for value constraints, etc.) some may be easier to tick off than others ...

gsrohde commented 9 years ago

I just saw this--too late now. I haven't yet gotten to writing up any migrations for anything other than foreign key constraints.

dlebauer commented 9 years ago

what you've been doing is fine