Create generic function for updating / removing duplicates

dlebauer commented 9 years ago

When implementing uniqueness constraints, duplicate records arise

Need a function (in sql or r) that will do the following:

identify dependent tables
make all referents point to same record
delete duplicate

The arguments to the function would be tablename, record and duplicate_record

To identify all referring tables:

SELECT
    TABLE_NAME
FROM
    information_schema. COLUMNS
WHERE
    table_schema = 'public'
AND COLUMN_NAME = 'citation_id'
AND TABLE_NAME NOT LIKE '%view%';

Here is some pseudo-code

referents <- select table_name from ... (see above)

function(table, id, duplicate_id){
  fk <- paste(table, '_id')
  for (referent in referents)
    update referent set  fk = id where fk = duplicate_id
  delete from table where id = duplicate_id

gsrohde commented 9 years ago

Here's the SQL function I came up with. It doesn't handle the case where an update would cause a uniqueness violation except by reporting the error and continuing.

You can try this out by pasting the code into a psql session or Navicat query window (preferably on a copy of BETYdb until we're sure there are no glitches). The function arguments, if it isn't obvious, are (1) the name of the table you are eliminating duplicates from; (2) the id of the row you are eliminating; (3) the id of the row you are keeping.

Note that you will have to run this in a SELECT statement, for example:

SELECT update_refs_from_to('citations', 286, 289);

CREATE OR REPLACE FUNCTION update_refs_from_to(
  primary_table_name varchar,
  old_id bigint,
  new_id bigint
) RETURNS void AS $$
DECLARE
  foreign_key_col_name varchar;
  referring_table_name varchar;
  update_stmt varchar;
  delete_stmt varchar;
BEGIN
  foreign_key_col_name := regexp_replace(primary_table_name, 's$', '') || '_id';
  FOR referring_table_name IN SELECT table_name FROM information_schema.columns WHERE table_schema = 'public' AND "column_name" = foreign_key_col_name AND is_updatable = 'YES' LOOP

    BEGIN
      update_stmt := 'UPDATE ' || referring_table_name || ' SET ' || foreign_key_col_name || ' = ' || new_id || ' WHERE ' || foreign_key_col_name || ' = ' || old_id;
      RAISE NOTICE 'Attempting to run %', update_stmt;
      EXECUTE update_stmt;
      RAISE NOTICE 'Success!';
    EXCEPTION
      WHEN unique_violation THEN
        RAISE NOTICE 'UPDATE FAILED!!!';
        RAISE NOTICE 'Updating table column % in table % would violate uniqueness constraints', foreign_key_col_name, referring_table_name;
    END;
  END LOOP;
  BEGIN
    delete_stmt := 'DELETE FROM ' || primary_table_name || ' WHERE id = ' || old_id;
      RAISE NOTICE 'Attempting to run %', delete_stmt;
      EXECUTE delete_stmt;
      RAISE NOTICE 'Success!';
  EXCEPTION
    WHEN foreign_key_violation THEN
      RAISE NOTICE 'DELETION FAILED!!!';
      RAISE NOTICE 'Deletion from table % of the row with id % would cause a foreign-key violation', primary_table_name, old_id;
  END;
END
$$ LANGUAGE plpgsql;

gsrohde commented 9 years ago

@dlebauer It occurs to me now that I should update this function to update the updated_at column if it exists (and I think it always or nearly always does) and also update the updated_user_id column if it exists, or if not, the user_id column if it exists.

If you agree, should I also do this when I am normalizing the whitespace in textual columns? (I've already done some whitespace normalization without doing this.)

dlebauer commented 9 years ago

Yes,

update this function to update the updated_at column

looks like it just requires adding , updated_at = now() before WHERE in the update statement

also do this when I am normalizing the whitespace

Technically, yes, though don't worry about the previously updated rows. I have frequently forgotten to update this field.

dlebauer commented 8 years ago

@gsrohde can we add this function to the schema? is it a good idea?

dlebauer commented 5 years ago

can this be added as a stored procedure?

gsrohde commented 5 years ago

@dlebauer

Bona fide “Stored Procedures” don’t appear in PostgreSQL until version 11 and we require only version 9.4, so I’ll take this in the informal sense of “function that doesn’t return a value.”

As it stands, this function is seriously deficient and even rather dangerous, and I would hesitate to add it in its current form. For example:

It doesn’t roll back updates when some succeed and others fail.
It fails to update a join table when the table joins a row in the external table to rows in the primary table both using the old id and the new one. Or rather it fails if the join table has the uniqueness constraint it should have: preventing doing the same join more than once. If there is no uniqueness constraint on the join table, then it ends up having multiple rows joining the same two referred-to rows.
The “notice” output doesn’t make quite clear what is going on.

These are minor problems and could be corrected without too much trouble.

There is, however, a much bigger problem: This is that it is nearly impossible to effectively impose uniqueness and foreign-key constraints with our current distributed BETYdb/synchronization system. (Value constraints, which are “row-local”, are not a problem.)

As an example, suppose machine 2 has a PFT composed of several species whose ids are in the machine 0 range. Then the administrator of the machine 0 database decides to eliminate the row corresponding to one of those ids because the information it contains duplicates another species row. Even if the pfts and pfts_species tables on machine 0 contain the relevant information from machine 2 and even if the pfts_species table is updated correctly to reflect the combining of the species rows, these corrections will never be propagated to machine 2 because machine 2 “owns” the data it originated and the data it owns will not be updated when and if it syncs with machine 0. Moreover, the fixes to machine 0’s pfts_species table will be overwritten if and when it syncs with machine 2.

Conversely, if the administrator of machine 2 attempts to sync with machine 0, either the sync will fail or there will suddenly be dangling references to the now-deleted row from machine 0. (I think the sync script was “fixed” so that the latter happens.)

(As an aside, there probably should be a restriction on editing imported data and metadata in the BETYdb Rails app since it will simply be overwritten the next time data is reimported from the originating machine.)

There may be ways to mitigate this problem. For example, if the primary purpose of data synchronization is to make multiple data sets available from a single machine, then perhaps we could require, for example, that all references in a row that originated on machine A be to other rows that originated on machine A. All the traits on machine 8, for example, refer to the same species, so it would not be of great cost for machine 8 to have its own copy of this species row. To what extent this applies to other metadata and other machines, I don’t know.

Short of a systemic fix such as this, the best approach may be to announce a general revision of ids to all administrators of the various distributed copies of BETYdb, perform the revisions on the EBI machine, and then send out scripts to each administrator that will suitably update their own machines. In any case, this is far beyond the scope of a simple procedure such as this.

PecanProject / bety

Create generic function for updating / removing duplicates #185