ExpediaGroup / circus-train

Circus Train is a dataset replication tool that copies Hive tables between clusters and clouds.
Apache License 2.0
86 stars 15 forks source link

Circus Train uses the wrong database for housekeeping #129

Closed AnanaMJ closed 5 years ago

AnanaMJ commented 5 years ago

Steps for reproducing:

Expected outcome: In the second step, Circus Train should use schema_2 to schedule paths for housekeeping.

Current outcome: In the second step, Circus Train uses schema_1 to find tables

INFO ast.ASTQueryTranslatorFactory: HHH000397: Using ASTQueryTranslatorFactory
INFO hbm2ddl.SchemaUpdate: HHH000228: Running hbm2ddl schema update
INFO hbm2ddl.SchemaUpdate: HHH000102: Fetching database metadata
INFO hbm2ddl.SchemaUpdate: HHH000396: Updating schema
INFO hbm2ddl.TableMetadata: HHH000261: Table found: schema_1.audit_revision
INFO hbm2ddl.TableMetadata: HHH000037: Columns: [id, timestamp]
INFO hbm2ddl.TableMetadata: HHH000108: Foreign keys: []
INFO hbm2ddl.TableMetadata: HHH000126: Indexes: [primary]
INFO hbm2ddl.TableMetadata: HHH000261: Table found: schema_1.legacy_replica_path
INFO hbm2ddl.TableMetadata: HHH000037: Columns: [metastore_database_name, path, event_id, creation_timestamp, path_event_id, id, metastore_table_name]
INFO hbm2ddl.TableMetadata: HHH000108: Foreign keys: []
INFO hbm2ddl.TableMetadata: HHH000126: Indexes: [primary]
INFO hbm2ddl.SchemaUpdate: HHH000232: Schema update complete

and then uses schema_2 to schedule paths for deletion:

INFO replica.HousekeepingCleanupLocationManager: Scheduling old replica data for deletion for event event-id: s3://bucket/location/event-id
WARN spi.SqlExceptionHelper: SQL Error: 1146, SQLState: 42S02
ERROR spi.SqlExceptionHelper: Table 'schema_2.legacy_replica_path' doesn't exist
WARN spi.SqlExceptionHelper: SQL Warning Code: 1146, SQLState: 42S02
WARN spi.SqlExceptionHelper: Table 'schema_2.legacy_replica_path' doesn't exist
ERROR core.Locomotive: Failed to replicate: source-metastore:source_db.source_table to replica-metastore:replica_db.replica_table.
AnanaMJ commented 5 years ago

More details about the issue:

It looks like hibernate resolves the wrong schema when initially validating tables. This happens when using both update and validate for spring.jpa.hibernate.ddl-auto.

There are two databases which are supposed to have the same tables with the same schema. schema_1 and schema_2 are both created in a MySQL instance, with no tables (hibernate is supposed to create the tables).

If hibernate runs the first time with spring.jpa.properties.hibernate.default_schema: schema_1, then the tables are created there. Then, if hibernate runs the second time with spring.jpa.properties.hibernate.default_schema: schema_2, hibernate validates schema_1's tables, and tries to insert data into schema_2's tables, which are not yet created, resulting in an error after all data has been replicated by Circus Train:

INFO  org.hibernate.Version:54 - HHH000412: Hibernate Core {4.3.11.Final}
INFO  org.hibernate.cfg.Environment:239 - HHH000206: hibernate.properties not found
INFO  org.hibernate.cfg.Environment:346 - HHH000021: Bytecode provider name : javassist
INFO  org.hibernate.annotations.common.Version:66 - HCANN000001: Hibernate Commons Annotations {4.0.5.Final}
INFO  org.hibernate.dialect.Dialect:145 - HHH000400: Using dialect: org.hibernate.dialect.MySQL5Dialect
INFO  org.hibernate.hql.internal.ast.ASTQueryTranslatorFactory:47 - HHH000397: Using ASTQueryTranslatorFactory
INFO  org.hibernate.tool.hbm2ddl.SchemaUpdate:207 - HHH000228: Running hbm2ddl schema update
INFO  org.hibernate.tool.hbm2ddl.SchemaUpdate:218 - HHH000102: Fetching database metadata
INFO  org.hibernate.tool.hbm2ddl.SchemaUpdate:230 - HHH000396: Updating schema
INFO  org.hibernate.tool.hbm2ddl.TableMetadata:65 - HHH000261: Table found: custom_schema_1.audit_revision
INFO  org.hibernate.tool.hbm2ddl.TableMetadata:66 - HHH000037: Columns: [id, timestamp]
INFO  org.hibernate.tool.hbm2ddl.TableMetadata:68 - HHH000108: Foreign keys: []
INFO  org.hibernate.tool.hbm2ddl.TableMetadata:69 - HHH000126: Indexes: [primary]
INFO  org.hibernate.tool.hbm2ddl.TableMetadata:65 - HHH000261: Table found: custom_schema_1.legacy_replica_path
INFO  org.hibernate.tool.hbm2ddl.TableMetadata:66 - HHH000037: Columns: [metastore_database_name, path, event_id, creation_timestamp, path_event_id, id, metastore_table_name]
INFO  org.hibernate.tool.hbm2ddl.TableMetadata:68 - HHH000108: Foreign keys: []
INFO  org.hibernate.tool.hbm2ddl.TableMetadata:69 - HHH000126: Indexes: [primary]
INFO  org.hibernate.tool.hbm2ddl.SchemaUpdate:267 - HHH000232: Schema update complete

...

INFO  com.hotels.bdp.circustrain.core.replica.HousekeepingCleanupLocationManager:61 - Scheduling old replica data for deletion for event ctp-20190503t100231.183z-cuppcmtd: s3://hcom-data-lab-bdp-applications/bdp_home/usa_names_no_filter_copy/ctp-20190503t095740.735z-qzbljkyl/ctp-20180907t104444.255z-uynosgiq/4b32923f-057e-4e3f-a789-bc607a3676f3
WARN  org.hibernate.engine.jdbc.spi.SqlExceptionHelper:144 - SQL Error: 1146, SQLState: 42S02
ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper:146 - Table 'custom_schema_2.legacy_replica_path' doesn't exist
WARN  org.hibernate.engine.jdbc.spi.SqlExceptionHelper:232 - SQL Warning Code: 1146, SQLState: 42S02
WARN  org.hibernate.engine.jdbc.spi.SqlExceptionHelper:233 - Table 'custom_schema_2.legacy_replica_path' doesn't exist
ERROR com.hotels.bdp.circustrain.core.Locomotive:121 - Failed to replicate: shared-metastore:bdp.usa_names_no_filter to local-metastore:bdp_home.usa_names_no_filter_copy.
com.hotels.bdp.circustrain.api.CircusTrainException: Unable to replicate

However, if hibernate runs the first time with spring.jpa.properties.hibernate.default_schema: schema_2 (before the tables in schema_1 are created by hibernate), then the tables are created correctly and the job runs successfully.


As an easy workaround, the Circus Train job can be run with the following line added to the configuration:

spring.jpa.hibernate.ddl-auto: create

The first run creates the tables in schema_2 and the Circus Train job is then able to schedule paths to the housekeeping database for later use. Then, the added configuration line should be removed and the other subsequent runs will use the correct database (schema_2) to schedule paths for housekeeping, but still use the wrong one (schema_1) to validate the schema. The configuration like with create should be removed because, if kept for subsequent runs, hibernate will drop the existing table and create a new one each time.

AnanaMJ commented 5 years ago

A fix for this is upgrading to Hibernate 5.1.0.Final.

We had to remove the @Audit annotations for legacy_replica_path, because hibernate was trying to re-create foreign keys and failing.

One problem with the upgrade, however, is that when running with spring.jpa.properties.hibernate.hbm2ddl.auto: update, even after tables are created, the following warnings come up (for each table that it validates):

INFO  org.hibernate.Version:37 - HHH000412: Hibernate Core {5.1.0.Final}
INFO  org.hibernate.cfg.Environment:213 - HHH000206: hibernate.properties not found
INFO  org.hibernate.cfg.Environment:317 - HHH000021: Bytecode provider name : javassist
INFO  org.hibernate.annotations.common.Version:66 - HCANN000001: Hibernate Commons Annotations {5.0.1.Final}
WARN  org.hibernate.orm.deprecation:786 - HHH90000006: Attempted to specify unsupported NamingStrategy via setting [hibernate.ejb.naming_strategy]; NamingStrategy has been removed in favor of the split ImplicitNamingStrategy and PhysicalNamingStrategy; use [hibernate.implicit_naming_strategy] or [hibernate.physical_naming_strategy], respectively, instead.
INFO  com.zaxxer.hikari.HikariDataSource:95 - HikariPool-1 - Starting...
INFO  com.zaxxer.hikari.HikariDataSource:107 - HikariPool-1 - Start completed.
INFO  org.hibernate.dialect.Dialect:156 - HHH000400: Using dialect: org.hibernate.dialect.MySQL5Dialect
INFO  org.hibernate.envers.boot.internal.EnversServiceImpl:91 - Envers integration enabled? : true
INFO  org.hibernate.tool.schema.extract.internal.InformationExtractorJdbcDatabaseMetaDataImpl:365 - HHH000262: Table not found: hibernate_sequence
DEBUG org.hibernate.SQL:92 - create table hibernate_sequence (next_val bigint)
WARN  org.hibernate.tool.schema.internal.ExceptionHandlerLoggedImpl:27 - GenerationTarget encountered exception accepting command : Unable to execute command [create table hibernate_sequence (next_val bigint)]
org.hibernate.tool.schema.spi.CommandAcceptanceException: Unable to execute command [create table hibernate_sequence (next_val bigint)]
    at org.hibernate.tool.schema.internal.exec.GenerationTargetToDatabase.accept(GenerationTargetToDatabase.java:63)
    at org.hibernate.tool.schema.internal.SchemaMigratorImpl.applySqlString(SchemaMigratorImpl.java:567)
    at org.hibernate.tool.schema.internal.SchemaMigratorImpl.applySqlStrings(SchemaMigratorImpl.java:551)
    at org.hibernate.tool.schema.internal.SchemaMigratorImpl.createTable(SchemaMigratorImpl.java:339)
    at org.hibernate.tool.schema.internal.SchemaMigratorImpl.performMigration(SchemaMigratorImpl.java:257)
    at org.hibernate.tool.schema.internal.SchemaMigratorImpl.doMigration(SchemaMigratorImpl.java:137)
    at org.hibernate.tool.schema.internal.SchemaMigratorImpl.doMigration(SchemaMigratorImpl.java:110)
    at org.hibernate.tool.schema.spi.SchemaManagementToolCoordinator.performDatabaseAction(SchemaManagementToolCoordinator.java:176)
    at org.hibernate.tool.schema.spi.SchemaManagementToolCoordinator.process(SchemaManagementToolCoordinator.java:64)
    at org.hibernate.internal.SessionFactoryImpl.<init>(SessionFactoryImpl.java:458)
    at org.hibernate.boot.internal.SessionFactoryBuilderImpl.build(SessionFactoryBuilderImpl.java:465)
    at org.hibernate.jpa.boot.internal.EntityManagerFactoryBuilderImpl.build(EntityManagerFactoryBuilderImpl.java:881)
    at org.springframework.orm.jpa.vendor.SpringHibernateJpaPersistenceProvider.createContainerEntityManagerFactory(SpringHibernateJpaPersistenceProvider.java:60)
    at org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean.createNativeEntityManagerFactory(LocalContainerEntityManagerFactoryBean.java:340)
    at org.springframework.orm.jpa.AbstractEntityManagerFactoryBean.afterPropertiesSet(AbstractEntityManagerFactoryBean.java:319)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1641)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1578)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:545)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:482)
    at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:306)
    at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:230)
    at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:302)
    at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:197)
    at org.springframework.context.support.AbstractApplicationContext.getBean(AbstractApplicationContext.java:1054)
    at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:829)
    at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:538)
    at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:760)
    at org.springframework.boot.SpringApplication.createAndRefreshContext(SpringApplication.java:360)
    at org.springframework.boot.SpringApplication.run(SpringApplication.java:306)
    at com.hotels.bdp.circustrain.CircusTrain.main(CircusTrain.java:117)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: java.sql.SQLSyntaxErrorException: Table 'hibernate_sequence' already exists
    at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120)
    at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
    at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
    at com.mysql.cj.jdbc.StatementImpl.executeInternal(StatementImpl.java:782)
    at com.mysql.cj.jdbc.StatementImpl.execute(StatementImpl.java:666)
    at com.zaxxer.hikari.pool.ProxyStatement.execute(ProxyStatement.java:98)
    at com.zaxxer.hikari.pool.HikariProxyStatement.execute(HikariProxyStatement.java)
    at org.hibernate.tool.schema.internal.exec.GenerationTargetToDatabase.accept(GenerationTargetToDatabase.java:51)
    ... 35 more
DEBUG org.hibernate.SQL:92 - insert into hibernate_sequence values ( 1 )
INFO  org.hibernate.tool.schema.extract.internal.InformationExtractorJdbcDatabaseMetaDataImpl:365 - HHH000262: Table not found: hibernate_sequence

These are just warnings, so Circus Train runs successfully (and it doesn't delete data in the housekeeping tables in between runs), but the warnings are not valid. The tables do exist.

It also tries to drop and add a constraint, and I'm not sure what's happening with that:

INFO org.hibernate.tool.schema.extract.internal.InformationExtractorJdbcDatabaseMetaDataImpl:365 - HHH000262: Table not found: legacy_replica_path
DEBUG org.hibernate.SQL:92 - alter table legacy_replica_path drop constraint UKtlauv1txmi2vux3fqp1y1ueav
DEBUG org.hibernate.SQL:92 - alter table legacy_replica_path add constraint UKtlauv1txmi2vux3fqp1y1ueav unique (path, creation_timestamp)
INFO  org.hibernate.tool.schema.extract.internal.InformationExtractorJdbcDatabaseMetaDataImpl:365 - HHH000262: Table not found: legacy_replica_path
AnanaMJ commented 5 years ago

After further investigations into the AWS documentation, we discovered that a MariaDB driver should be used with MySQL instance on EMR or Amazon RDS for MySQL.

Trying different MySQL driver versions didn't work, and it looks like MariaDB driver (1.8.0 - for Java 7) solves our issues:


The changes were as following:

patduin commented 5 years ago

The MariaDB driver is also what AWS recommends for Aurora(MySQL). See https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Connecting.html

AnanaMJ commented 5 years ago

After looking into the hive-site.xml file on an EMR, we also noticed that the driver used is MariaDB:

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.mariadb.jdbc.Driver</value>
</property>

which is used for

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://hostname:3306/hive?createDatabaseIfNotExist=true</value>
</property>

The jar it currently uses seems to be mariadb-java-client-2.2.1.jar, which is fairly recent.