Middlecon / DBImport

DBImport ingestion tool. Handle import, export and standard ETL flows in Hadoop/Hive
Apache License 2.0
16 stars 6 forks source link

Oracle Flashback imports get Merge cardinality_violation #31

Closed BerryOsterlund closed 5 years ago

BerryOsterlund commented 5 years ago

If a row is deleted, inserted, deleted again then inserted one more time and at the last point, and updates aswell in the same windows between SCN timestamps, Hive Merge command will generate a cardinality_violation error. The reason for this is that the current logic in DBImport will import the Deleted and Updated rows and that will result in 3 rows with the same PK.

BerryOsterlund commented 5 years ago

VERSIONS_STARTSCN column was added to the sqoop config for Oracle Flashback imports. This column is later used for a "select , max(VERSIONS_STARTSCN).... GROUP BY " to get the Maximuln SCN for each and every PK. This list is then joined against it self to only get the rows with the unique PK and the highest VERSIONS_STARTSCN. The join was implemented in the view based on the import table.

As VERSIONS_STARTSCN has to be cast into a BigInt, there will always be a View so it's safe to use this option.