activewarehouse / activewarehouse-etl

Extract-Transform-Load library from ActiveWarehouse
MIT License
240 stars 102 forks source link

File destination handling of booleans #30

Open lgustafson opened 13 years ago

lgustafson commented 13 years ago

A typical use case of the "file destination" (lib/etl/control/destination/file_destination.rb) is to populate a file for bulk loading into a database. However, there is an issue when using SCD workflows with this type of destination. When defining a type 2 SCD, you must supply the name of the column that holds a boolean indicating whether it's the latest version of the row. Accordingly, during the SCD workflow activewarehouse-etl sets the column to true or false in the pipeline.

The trouble is, when the column is written to the file, true and false get written as "true" and "false" because that's the result of true.to_s or false.to_s. These are not valid values for a MySQL BOOLEAN, so the subsequent bulk load fails.

The easy solution is to modify file_destination.rb to coerce TrueClass and FalseClass to "1" and "0", respectively.

However, this may not be the best solution since there are other use cases for a file destination and perhaps others are depending on booleans being written as "true" or "false".

thbar commented 13 years ago

Thanks for reporting this, first! This raises a good bunch of questions :)

If the other supported databases (afaik: SqlServer and Postgresql) support 0/1 for false/true on bulk-load, then I would find it neat to have boolean values mapped to 0/1, for all kinds of columns (not only the latest SCD system-column). My understanding is that it would be a good expected behaviour.

If we do so, the user would still be able to remap the column using a transform (&:to_s) if he prefers to get "true"/"false". In all cases, this would be a breaking change and would need to be advertised as such in the next release.

Would you be able to provide a patch with your fix and tests, tested on MySQL ?

If you do so:

thbar commented 13 years ago

Interesting commit: https://github.com/koconnor/activewarehouse-etl/commit/2c87d9de9daa2e7aea43512dcee81d40ac73b683

(may only be appropriate for mysql)

lgustafson commented 13 years ago

Yes, that is a possible solution, though I have no idea if that might break if one were to use a database destination. A change would also need to be made to #process_scd_match.

This may very well be the most practical approach. Theoretically though, representing true/false as 1/0 in Ruby bothers me, since as integers they both evaluate to true.

Another possibility would be to add a configuration to the file destination that allows one to provide the desired string representation of true and false, since it could vary by database implementation.