Clarify documentation re table structure differences between source and destination

mgierdal commented 1 year ago

In my experience, copy_table_rows() requires the presence of a table in the destination DB. Inclusion of its creation in copy_table_rows() or adding a separate function that creates it in destination (or composes CREATE statement) based on the source table would be helpful.

volcan01010 commented 1 year ago

Hi @mgierdal,

Yes. Although you can't exactly call it a "feature", it is part of how etlhelper is designed. The aim is to keep the code simple, but creating a table based on the given data is really hard. You need to know about data types for each column and the specific commands for the target database. If your target database doesn't have the table already, there are two options:

Use etlhelper.execute with "CREATE TABLE IF NOT EXISTS ..." to add the table in your script before you run copy_table_rows. Defining the table by hand is more work but you get to specify exactly how you want it.
Use an Object Relation Mapper tool such as SQLAlchemy instead of ETLHelper. It can create the table for you and understands the different database types. See https://www.tutorialspoint.com/sqlalchemy/sqlalchemy_core_creating_table.htm for an example. The downside it that learning the framework and defining the models may be too much work when you just need to copy some rows.

I hope that helps,

John

mgierdal commented 1 year ago

Fair enough. Could you elaborate on how copy_table_rows deals with disagreement between data from src and the dest table - what fails and what is acceptable? Perhaps there could be an entry in the tutorial explaining that. I found this package yesterday and already used it in a project with quite success, surprises aside,

Another one is that table_info() seems to return field names in order that sometimes (i.e. for certain tables) is not what src table looks like. This forced me to build INSERT statement using table_info from dest, which in turn was built using a hardcoded CREATE statement.

Thanks! Marcin

volcan01010 commented 1 year ago

Thanks, Marcin, I'm glad that you are finding the code useful. The documentation is the weak point now and I do have plans to update to a proper documentation site. There is a ticket about that here: https://github.com/BritishGeologicalSurvey/etlhelper/issues/12. You can note anything else that you find confusing or think should be better explained there.

I hadn't noticed that about table_info column name ordering. They are returned in the order that they come from the database. load might help you here - if you pass a dictionaries where they keys match the column names, it can write the INSERT statement for you.

In terms of dealing with data disagreements, you have a few options.

Create a function to use with on_error. It will collect all the failing rows in each chunk and their errors. You can then handle them within Python. The potential downside of this method is it can be slow if you have many errors as each row must be checked individually. https://github.com/BritishGeologicalSurvey/etlhelper#handling-insert-errors
Modify the INSERT statement to make the database handle the errors. I recommend having a look at the etl.py source code to see the available functions and how they work. copy_table_rows is just iter_rows, followed by load. And load calls generate_insert_sql to work out the insert statement. You call those functions individually and modify the INSERT statement to make the database handle the errors e.g. it can replace the value, ignore or update depending on the database you are using (https://stackoverflow.com/questions/690632/how-do-i-update-a-row-in-a-table-or-insert-it-if-it-doesnt-exist?noredirect=1&lq=1)

volcan01010 commented 1 year ago

I'm about to push a minor tweak to the current README to highlight that the table is not created automatically.

volcan01010 commented 1 year ago

Closed by #179

BritishGeologicalSurvey / etlhelper

Clarify documentation re table structure differences between source and destination #145