Open dlebauer opened 9 years ago
Here's the SQL function I came up with. It doesn't handle the case where an update would cause a uniqueness violation except by reporting the error and continuing.
You can try this out by pasting the code into a psql session or Navicat query window (preferably on a copy of BETYdb until we're sure there are no glitches). The function arguments, if it isn't obvious, are (1) the name of the table you are eliminating duplicates from; (2) the id of the row you are eliminating; (3) the id of the row you are keeping.
Note that you will have to run this in a SELECT
statement, for example:
SELECT update_refs_from_to('citations', 286, 289);
CREATE OR REPLACE FUNCTION update_refs_from_to(
primary_table_name varchar,
old_id bigint,
new_id bigint
) RETURNS void AS $$
DECLARE
foreign_key_col_name varchar;
referring_table_name varchar;
update_stmt varchar;
delete_stmt varchar;
BEGIN
foreign_key_col_name := regexp_replace(primary_table_name, 's$', '') || '_id';
FOR referring_table_name IN SELECT table_name FROM information_schema.columns WHERE table_schema = 'public' AND "column_name" = foreign_key_col_name AND is_updatable = 'YES' LOOP
BEGIN
update_stmt := 'UPDATE ' || referring_table_name || ' SET ' || foreign_key_col_name || ' = ' || new_id || ' WHERE ' || foreign_key_col_name || ' = ' || old_id;
RAISE NOTICE 'Attempting to run %', update_stmt;
EXECUTE update_stmt;
RAISE NOTICE 'Success!';
EXCEPTION
WHEN unique_violation THEN
RAISE NOTICE 'UPDATE FAILED!!!';
RAISE NOTICE 'Updating table column % in table % would violate uniqueness constraints', foreign_key_col_name, referring_table_name;
END;
END LOOP;
BEGIN
delete_stmt := 'DELETE FROM ' || primary_table_name || ' WHERE id = ' || old_id;
RAISE NOTICE 'Attempting to run %', delete_stmt;
EXECUTE delete_stmt;
RAISE NOTICE 'Success!';
EXCEPTION
WHEN foreign_key_violation THEN
RAISE NOTICE 'DELETION FAILED!!!';
RAISE NOTICE 'Deletion from table % of the row with id % would cause a foreign-key violation', primary_table_name, old_id;
END;
END
$$ LANGUAGE plpgsql;
@dlebauer It occurs to me now that I should update this function to update the updated_at
column if it exists (and I think it always or nearly always does) and also update the updated_user_id
column if it exists, or if not, the user_id
column if it exists.
If you agree, should I also do this when I am normalizing the whitespace in textual columns? (I've already done some whitespace normalization without doing this.)
Yes,
update this function to update the updated_at column
looks like it just requires adding , updated_at = now()
before WHERE
in the update statement
also do this when I am normalizing the whitespace
Technically, yes, though don't worry about the previously updated rows. I have frequently forgotten to update this field.
@gsrohde can we add this function to the schema? is it a good idea?
can this be added as a stored procedure?
@dlebauer
Bona fide “Stored Procedures” don’t appear in PostgreSQL until version 11 and we require only version 9.4, so I’ll take this in the informal sense of “function that doesn’t return a value.”
As it stands, this function is seriously deficient and even rather dangerous, and I would hesitate to add it in its current form. For example:
These are minor problems and could be corrected without too much trouble.
There is, however, a much bigger problem: This is that it is nearly impossible to effectively impose uniqueness and foreign-key constraints with our current distributed BETYdb/synchronization system. (Value constraints, which are “row-local”, are not a problem.)
As an example, suppose machine 2 has a PFT composed of several species whose ids are in the machine 0 range. Then the administrator of the machine 0 database decides to eliminate the row corresponding to one of those ids because the information it contains duplicates another species row. Even if the pfts
and pfts_species
tables on machine 0 contain the relevant information from machine 2 and even if the pfts_species
table is updated correctly to reflect the combining of the species rows, these corrections will never be propagated to machine 2 because machine 2 “owns” the data it originated and the data it owns will not be updated when and if it syncs with machine 0. Moreover, the fixes to machine 0’s pfts_species
table will be overwritten if and when it syncs with machine 2.
Conversely, if the administrator of machine 2 attempts to sync with machine 0, either the sync will fail or there will suddenly be dangling references to the now-deleted row from machine 0. (I think the sync script was “fixed” so that the latter happens.)
(As an aside, there probably should be a restriction on editing imported data and metadata in the BETYdb Rails app since it will simply be overwritten the next time data is reimported from the originating machine.)
There may be ways to mitigate this problem. For example, if the primary purpose of data synchronization is to make multiple data sets available from a single machine, then perhaps we could require, for example, that all references in a row that originated on machine A be to other rows that originated on machine A. All the traits on machine 8, for example, refer to the same species, so it would not be of great cost for machine 8 to have its own copy of this species row. To what extent this applies to other metadata and other machines, I don’t know.
Short of a systemic fix such as this, the best approach may be to announce a general revision of ids to all administrators of the various distributed copies of BETYdb, perform the revisions on the EBI machine, and then send out scripts to each administrator that will suitably update their own machines. In any case, this is far beyond the scope of a simple procedure such as this.
When implementing uniqueness constraints, duplicate records arise
Need a function (in sql or r) that will do the following:
The arguments to the function would be
tablename
,record
andduplicate_record
To identify all referring tables:
Here is some pseudo-code