PecanProject / bety

Web-interface to the Biofuel Ecophysiological Traits and Yields Database (used by PEcAn and TERRA REF)
https://www.betydb.org
BSD 3-Clause "New" or "Revised" License
16 stars 38 forks source link

managements: make (date, mgmttype) a key #191

Open gsrohde opened 9 years ago

gsrohde commented 9 years ago
SELECT * FROM managements WHERE mgmttype IS NULL;

yields 200 rows!

Non-uniqueness of (date, mgmttype)

SELECT ARRAY_AGG(id), date, mgmttype FROM managements GROUP BY date, mgmttype HAVING COUNT(*) > 1;

yields 678 groups, some involving as many as two dozen or more rows! This will only increase after NULLs are eliminated. Is this really a viable key, or do we need to include another column?

To see the full rows of all of the duplicates, try running

SELECT * FROM managements m WHERE EXISTS(SELECT 1 FROM managements m2 WHERE m.id != m2.id AND m.date = m2.date AND m.mgmttype = m2.mgmttype) ORDER BY mgmttype, date;

Adding constraints

The code is

ALTER TABLE managements ALTER COLUMN date SET NOT NULL;
ALTER TABLE managements ALTER COLUMN mgmttype SET NOT NULL;
ALTER TABLE managements ADD CONSTRAINT unique_date_per_mgmttype UNIQUE (date, mgmttype);

In addition to constraints for implementing the key, we should ensure mgmttype is whitespace-normalized.

dlebauer commented 8 years ago

date can be null. Many of these are fertilization rates:

SELECT mgmttype, count(*) as n 
   FROM managements 
   WHERE date IS NULL group by mgmttype order by n desc;

and others are rates of irrigation etc. Where the date(s) of fertilization is not known and may have occurred repeatedly over many years. So we can allow NULL dates for managements.

dlebauer commented 8 years ago

For the duplicates, it is only meaningful to expect that (date, mgmttype) will be unique for the subset of records associated with a single treatment_id.

gsrohde commented 8 years ago

Even if you group by not only date and mgmttype but level and units as well, there are plenty of duplicates even among managements associated with a common treatment. Try running

select distinct count(*), array_agg(m.id) treatment_id, date, mgmttype, level, units from managements_treatments mt join managements m on m.id = mt.management_id group by treatment_id, date, mgmttype, level, units having count(*) > 1;

At least most of the groups of duplicates are only of size 2. Decide if you want to worry about this.