cybertec-postgresql / db_migrator

Other
21 stars 8 forks source link

Generate intermediate SQL statements #33

Closed fljdin closed 1 year ago

fljdin commented 1 year ago

This PR adds a new set of methods between staging tables and migration methods (db_migrate_*)

Closes #26

laurenz commented 1 year ago

I agree with the general direction, but try not to stuff too many things in. You can always refactor more later.

Anyway, I ran the ora_migrator regression test, and it failed. The first difference was:

 SELECT db_migrate_tables(
    plugin => 'ora_migrator'
 );
-WARNING:  Error loading table data for testschema1.baddata
-DETAIL:  invalid byte sequence for encoding "UTF8": 0x00: 
- db_migrate_tables 
--------------------
-                 1
-(1 row)
-
+WARNING:  Error executing "INSERT INTO testschema1.baddata SELECT * FROM testschema1."baddata^G""
+DETAIL:  invalid byte sequence for encoding "UTF8": 0xf6 0x6e 0x20 0x6b: 
+ERROR:  "baddata" is not a foreign table
+HINT:  Use DROP TABLE to remove a table.
+CONTEXT:  SQL statement "DROP FOREIGN TABLE testschema1.baddata"
+PL/pgSQL function public.db_migrate_tables(name,name,boolean) line 45 at EXECUTE
 SELECT db_migrate_functions(
    plugin => 'ora_migrator'
 );

Looking at your changes to materialize_foreign_table(), I see why: Before, the following was run in a subtransaction (a BEGIN / EXCEPTION / END block) as follows

So if one of these steps failed, the whole subtransaction was rolled back, and all the above was undone.

Your refactored code did away with that subtransaction and relies entirely on execute_statement() to handle errors. Now in my regression tests, table baddata contains bad data that make copying the data from the foreign table to the local table fail. With my original code, renaming the foreign table and creating the local table was rolled back, while the refactored code leaves them in place, which leads to the above difference and errors further down the road.

fljdin commented 1 year ago

That's pretty uncomfortable and you are right about that bring too much changes in a raw. My starting point should have been created execute_statement() first, let you review it and merge it.

To ease review process, I'll start over c5d7c4d with a entire new PR and try to rebase this branch on it later.

laurenz commented 1 year ago

Do you still need this pull request for reference or shall we close it?

fljdin commented 1 year ago

I will work on new PRs based on this branch, following yours precious advices

After thoses changes, statements functions should be easier to implement and review

laurenz commented 1 year ago

I get it. Thanks for all the effort you are putting into db_migrator!

fljdin commented 1 year ago

I finally achieved the goal of creating the set of low-level functions. Each commit includes changes from a db_migrate_* method for easy review. Please let me know if I should squash them into one or not.

In several places, I noticed that the plugin parameter was probably not necessary but I did not go further by keeping them in the function signature.

fljdin commented 1 year ago

I wonder what the purpose of the change is. The code doesn't really get shorter. Is the goal to break out part of the code into functions, so that the functions don't grow too long?

I understand that the goal to be achieved is still nebulous. To tie in with a previous discussion #24, my general idea is to simplify the code by cutting it into smaller parts to make it possible to parallelize the CREATE INDEX and ADD CONSTRAINT instructions, like the materialize_foreign_table() function.

During my work, I also identified several aspects of the code that deserved better test coverage (e.g.: a bug on the increment of sequence #38). In the same way, the management of the search_path can be simplified between @extschema@ or a SET LOCAL search_path without them overwriting each other.

As you suggest in your comment, smaller functions reduce the complexity and responsibility they carry. A db_migrate_* (or wrapper) function is thus only responsible for displaying a RAISE NOTICE on the screen and executing the SQL statements in the correct order. A low-level function becomes responsible for constructing queries from the staging schema and returning them.

fljdin commented 1 year ago

I suggest to pause this draft again, so that I can explore another way to parallelize the construction of indexes since this is my primary motivation.

laurenz commented 1 year ago

I'm fine with the changes so far.

What do you think is lacking with the way indexes are created? Who wants to parallelize that, must do it herself. Or do you want to somehow introduce parallel processing?

fljdin commented 1 year ago

I was thinking about introducing an intermediate index function, that takes a table or a index name in parameter as you already did for materializing foreign table.

SELECT format('SELECT create_indexes(schema => ''%I'', table_name => = ''%I'');', schema, table_name)
FROM pgsql_stage.tables t
WHERE migrate AND EXISTS (
    SELECT index_name FROM pgsql_stage.indexes i
    WHERE i.schema = t.schema
    AND i.table_name = t.table_name
    AND i.migrate
) \g indexes.sql
-- SELECT create_indexes(schema => 'sch1', table_name => = 't1');
-- SELECT create_indexes(schema => 'sch1', table_name => = 't2');

-- execute in parallel
\! parallel -a indexes.sql psql -d contrib_regression -c {}
laurenz commented 1 year ago

Hm. That last line results in an error here:

sh: line 1: parallel: command not found

I think actual parallelization is beyond the scope of db_migrator; it will depend too much on the operating system. The design idea of db_migrator is that it is purely written in PL/pgSQL, so it is totally portable. I'd prefer to leave it to the user to actually distribute the results of construct_indexes_statements() to be executed on several database connections in parallel.

laurenz commented 1 year ago

Thanks! I wish you would force-push only when necessary.

fljdin commented 1 year ago

Hello Laurenz,

I see that the PR has been merged, thank you for the review and your enriching comments. I sincerely hope that the implementation of these low-level functions is going in the right direction.

To come back to your last question, the index creation example is based on the GNU parallel system tool to illustrate the type of query with the low-level API required for external tools.

Another less elegant query to use the log system would be like:

SELECT format(
    $$ SELECT execute_statements('create index', '%I', '%I', ARRAY[%s]); $$,
    schema_name, table_name, string_agg(format('$s$%s$s$', statement), ',')
) FROM construct_indexes_statements(plugin => 'noop_migrator')
 GROUP BY schema_name, table_name;

In my opinion, another patch will be needed to complete this work. The materialize_foreign_table() method is currently unchanged, and it is not possible to retrieve instructions for table creation, foreign table renaming or data import. I think this is necessary to allow more freedom in the use of the API. I'll develop my point of view in a new issue.

Regards, Florent

laurenz commented 1 year ago

I would not want this extension to depend on software like the GNU parallel system tool.

But I feel fine with you adding a chapter to the README that explains how a tool like that can be used to parallelize database migration, including some small scripts.

I don't see the necessity to add more low-level functions for materialize_foreign_table(). It is low-level enough as it is, and I think that it is easy enough for the user to list the foreign tables created by db_migrate_mkforeign() and call materialize_foreign_table() for the individual tables in concurrent database sessions.