ExpediaGroup / circus-train

Circus Train is a dataset replication tool that copies Hive tables between clusters and clouds.
Apache License 2.0
86 stars 15 forks source link

Housekeeping requires `circus_train` database #111

Closed baumandm closed 5 years ago

baumandm commented 5 years ago

I discovered an issue with changing the default Housekeeping schema name from circus_train to anything else.

Test command: /opt/circus-train/bin/circus-train.sh --config=/mnt/circus-train/jobs/test.yml --modules=replication --housekeeping.data-source.username=$HK_USER --housekeeping.data-source.password=$HK_PASSWORD

Test YAML file:


  instance:
    name: jetstream-ct
    home: /mnt/circus-train
  logging:
    config: file:${instance.home}/conf/log4j.xml
  housekeeping:
    schema-name: housekeeping
    data-source:
      driver-class-name: com.mysql.cj.jdbc.Driver
      url: jdbc:mysql://host:3306/housekeeping
  copier-options: 
    tmp-dir: "hdfs:///tmp/dsp-lab-jetstream-ct/"
    file-attribute: "replication, blocksize, user, group, permission, checksumtype"
    canned-acl: "bucket-owner-full-control"
    region: "us-west-2"
    max-maps: 50
    s3-server-side-encryption: true
    copier-factory-class: "com.hotels.bdp.circustrain.s3mapreducecpcopier.S3MapReduceCpCopierFactory"
  source-catalog: 
    name: "mauihdp"
    hive-metastore-uris: "thrift://host:9083"
  replica-catalog: 
    name: "apiary-lab-hms-us-west-2"
    hive-metastore-uris: "thrift://host.lcl:9083"
  table-replications: 
    - 
      source-table: 
        database-name: "dm"
        table-name: "ar_typ_dim"
        generate-partition-filter: true
      replica-table: 
        database-name: "dm"
        table-name: "ar_typ_dim"
        table-location: "s3://bucket/ar_typ_dim"
  sns-event-listener: 
    region: "us-west-2"
    topic: "arn:aws:sns:us-west-2:<topic<"
    subject: "CircusTrainStatus"
    headers: 
      requestId: "api13cda12a-18f8-11e9-9be1-abf8ece1bc04"
      databaseTableName: "dm.ar_typ_dim"
      route: "route"```

As shown above, housekeeping has been reconfigured to use a schema named `housekeeping` in a MySQL server.

When this schema does not exist, Circus Train fails with the following error:

```19/01/15 11:16:13 ERROR boot.SpringApplication: Application startup failed
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'org.springframework.boot.autoconfigure.orm.jpa.HibernateJpaAutoConfiguration': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private javax.sql.DataSource org.springframework.boot.autoconfigure.orm.jpa.JpaBaseConfiguration.dataSource; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'housekeepingDataSource' defined in class path resource [com/hotels/housekeeping/HousekeepingConfiguration.class]: Initialization of bean failed; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'dataSourceInitializer': Invocation of init method failed; nested exception is org.springframework.jdbc.datasource.init.ScriptStatementFailedException: Failed to execute SQL script statement #1 of class path resource [schema.sql]: CREATE SCHEMA IF NOT EXISTS circus_train; nested exception is com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Access denied for user 'jetstream_rw'@'%' to database 'circus_train'```
[ct-failed.txt](https://github.com/HotelsDotCom/circus-train/files/2761166/ct-failed.txt)

After creating `circus_train` schema, the above Circus Train command starts to work.  It also uses the configured `housekeeping` schema instead of the new `circus_train` schema:

```mysql> use circus_train;
Database changed

mysql> show tables;
Empty set (0.00 sec)

mysql> use housekeeping;
Database changed

mysql> show tables;
+------------------------+
| Tables_in_housekeeping |
+------------------------+
| audit_revision         |
| legacy_replica_path    |
+------------------------+
2 rows in set (0.01 sec)```

It appears that the initial schema check ignores the configured `housekeeping.schema-name` value, and instead always checks for the `circus_train` schema.  But if that schema does exist, Circus Train skips the `dataSourceInitializer` and the rest of the housekeeping code uses the correct schema as configured.