Closed mgierdal closed 1 year ago
Hi @mgierdal,
Yes. Although you can't exactly call it a "feature", it is part of how etlhelper
is designed. The aim is to keep the code simple, but creating a table based on the given data is really hard. You need to know about data types for each column and the specific commands for the target database. If your target database doesn't have the table already, there are two options:
etlhelper.execute
with "CREATE TABLE IF NOT EXISTS ..." to add the table in your script before you run copy_table_rows
. Defining the table by hand is more work but you get to specify exactly how you want it.I hope that helps,
John
Fair enough. Could you elaborate on how copy_table_rows
deals with disagreement between data from src and the dest table - what fails and what is acceptable? Perhaps there could be an entry in the tutorial explaining that.
I found this package yesterday and already used it in a project with quite success, surprises aside,
Another one is that table_info()
seems to return field names in order that sometimes (i.e. for certain tables) is not what src table looks like. This forced me to build INSERT statement using table_info
from dest, which in turn was built using a hardcoded CREATE statement.
Thanks! Marcin
Thanks, Marcin, I'm glad that you are finding the code useful. The documentation is the weak point now and I do have plans to update to a proper documentation site. There is a ticket about that here: https://github.com/BritishGeologicalSurvey/etlhelper/issues/12. You can note anything else that you find confusing or think should be better explained there.
I hadn't noticed that about table_info
column name ordering. They are returned in the order that they come from the database. load
might help you here - if you pass a dictionaries where they keys match the column names, it can write the INSERT statement for you.
In terms of dealing with data disagreements, you have a few options.
Create a function to use with on_error
. It will collect all the failing rows in each chunk and their errors. You can then handle them within Python. The potential downside of this method is it can be slow if you have many errors as each row must be checked individually. https://github.com/BritishGeologicalSurvey/etlhelper#handling-insert-errors
Modify the INSERT statement to make the database handle the errors. I recommend having a look at the etl.py source code to see the available functions and how they work. copy_table_rows
is just iter_rows
, followed by load
. And load
calls generate_insert_sql
to work out the insert statement. You call those functions individually and modify the INSERT statement to make the database handle the errors e.g. it can replace the value, ignore or update depending on the database you are using (https://stackoverflow.com/questions/690632/how-do-i-update-a-row-in-a-table-or-insert-it-if-it-doesnt-exist?noredirect=1&lq=1)
I'm about to push a minor tweak to the current README to highlight that the table is not created automatically.
Closed by #179
In my experience,
copy_table_rows()
requires the presence of a table in the destination DB. Inclusion of its creation incopy_table_rows()
or adding a separate function that creates it in destination (or composes CREATE statement) based on the source table would be helpful.