TonicAI / condenser

Condenser is a database subsetting tool
https://www.tonic.ai
MIT License
315 stars 47 forks source link

add routines flag to initial mysql dump #35

Closed derekyle closed 1 year ago

derekyle commented 2 years ago

Routines like functions and procedures don't dump by default when using the mysqldump command. This should dump all the routines data as well as the table schema. Without this, the synchronization fails near the end if you have any functions or procedures in your source database and of course you also don't sync over those routines.

Small note, I also replaced the cryptic -d flag with the more descriptive --no-data flag for clarity.

derekyle commented 1 year ago

I apologize for the confusion. I have not worked with github and pull requests much. I made a pull request using the master branch on my forked repo and assumed that the pull request was a snapshot in the moment the pull request was made. But it looks like my other changes have been added as I went along. You are correct, this pull request was meant only to add the necessary flags for mysql because otherwise the script will not work for any mysql database that has functions or procedures.

All my other changes are a work in progress for my needs. I could create a pull request for it a later date. Just FYI, the reason I have removed the db name prefixes from the config is that I am working on making changes so that the source and target table names get added by the script, from the config. Because my use case is to be able to run this in my kubernetes cluster to create a database subset from live data when we create containers for branches of code so that they can be tested using this more up to date data rather than testing on a mock database that has to be continually updated.

If you hard code the db name in front of every table name, then it would require us to change every prefix everytime we are launching a new testing environment since we run all of our testing databases in a single mysql instance. Same for developers on their local machine, they would all need to have the same database name and would not be able to run multiple test databases in a single mysql instance.

I will create a separate PR with my simple, parameters additions.

mwcaisse commented 1 year ago

No worries. Github asusmes you want to merge in all changes as part of the PR, even future ones. Which is nice as it allows you to respond to PR comments / feedback with out having to create a new PR/branch. In this case, you can cherry-pick the commits for the routine changes into a new branch based off of our master and create a new PR. I will close this one in the meantime.

As for the database names, you don't need to include the database names before the table name in the config file, just in the db_name in the connection info object. For example if you are using MySQL and have a database named MY_TEST_DATABASE and a table in it called TARGET_TABLE then this configuration will work currently without any changes:

{
    "initial_targets": [
        {
            "table": "TARGET_TABLE",
            "percent": 10
        }
    ],
    "db_type": "mysql",
    "source_db_connection_info": {
        "user_name": "user",
        "host": "host.host.com",
        "db_name": "MY_TEST_DATABASE",
        "port": 5432
    },
    "destination_db_connection_info": {
        "user_name": "user",
        "host": "host.host.com",
        "db_name": "MY_TEST_DATABASE",
        "port": 5432
    },
    "keep_disconnected_tables": false,
    "excluded_tables": [ ],
    "passthrough_tables": [ ],
    "dependency_breaks": [ ],
    "fk_augmentation": [ ],
    "upstream_filters": [ ]
 }

The example_all is an example for a PostgreSQL database, which has a slightly different structure than MySQL. the public. is a schema name not the database name. Fully qualified table identifiers for Postgres are <database-name>.<schema-name>,<table-name>. In MySQL it is just <database-name>.<tabke-name> as MySQL. Does that make sense?

Ah nevermind, I see the source of confusion, our documentation says to specify the table names with the database name in front of it. I think what I said still holds true and that isn't needed, but let me verify.

mwcaisse commented 1 year ago

Apologies, you are right, currently the table names need to be referenced with the database name infront of them on MySQL. (This behaves differently than other areas of our code base and I'm just starting to get up to speed on this project)

Having said that I do think it should work as described in my comment above, not having to specify the DB name in front of every table name in the configuration. Which making a few modifications to how we query / gather table and column information for MySQL to properly handle how schemas are handled in MySQL might be better than modifying the configuration. (i.e. update get_table_columns and list_all_tables in mysql_database_helper. That does involve changing a few things overall on how schemas are handed though)