GoogleCloudDataproc / initialization-actions

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
https://cloud.google.com/dataproc/init-actions
Apache License 2.0
587 stars 511 forks source link

Add Sqoop #268

Closed szewi closed 4 years ago

szewi commented 6 years ago

We should add sqoop to dataproc.

Sqoop is tool that allows to import/export data between different types of databases, it also allows to import data to hadoop to perform MR jobs. There is cloud-sql-proxy init action working, so we should configure sqoop to allow to pull the data from cloud SQL, we should also enable Sqoop jobs for BigTable.

Bidyadharbarik84 commented 5 years ago

Can you upload some sample script from cloud dataproc using sqoop

szewi commented 5 years ago

@Bidyadharbarik84 Please take a look on some example workflow. I assume you copy sqoop.sh from PR to your GCS bucket:

  1. Create cloud SQL database mysql and name it test-sql
  2. Create Bigtable dev instance and name it test-bigtable
  3. Create cluster with Sqoop:
    gcloud dataproc clusters create sqoop-single --single-node \
       --initialization-actions gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh,gs://dataproc-initialization-actions/hive-hcatalog/hive-hcatalog.sh,gs://dataproc-initialization-actions/bigtable/bigtable.sh,gs://<your-bucket-with-sqoop-init-action>/sqoop.sh \
       --scopes cloud-platform --properties hive:hive.metastore.warehouse.dir=gs://<your-bucket-for-hive>/hive-warehouse \
       --metadata "hive-metastore-instance=<you-project>:<you-region>:test-sql" \
       --metadata "bigtable-project=<you-project>" --metadata "bigtable-instance=test-bigtable"
  4. Log in: gcloud compute ssh sqoop-single-m
  5. Clone some test db: git clone https://github.com/datacharmer/test_db and cd test_db
  6. Upload some test _db to Cloud SQL with command mysql -u root < employees.sql
  7. Run sqoop import:
    /usr/lib/sqoop/bin/sqoop import --connect jdbc:mysql://localhost/employees --username root \
       --table employees --columns "emp_no,first_name" --hbase-table employees \
       --column-family my-column-family -hbase-row-key emp_no --m 1 --hbase-create-table

This command will import some columns from Cloud SQL employees db to BigTable instance using hbase binaries. You should see sth like this in the output:

19/02/19 16:01:50 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 20.5951 seconds (0 bytes/sec) 19/02/19 16:01:50 INFO mapreduce.ImportJobBase: Retrieved 300024 records.

Hope this helps.

martijnvdgrift commented 5 years ago

This is a gcloud command which submits a Sqoop job to a dataproc cluster:

gcloud dataproc jobs submit hadoop \
--cluster=incremental-imports-martijn-sqoop \
--class=org.apache.sqoop.Sqoop \
--jars=gs://mybucket/jars/sqoop-1.4.7-hadoop260.jar,gs://mybucket/jars/avro-tools-1.8.2.jar,file:///usr/share/java/mysql-connector-java-5.1.42.jar \
-- import -Dmapreduce.job.user.classpath.first=true \
--connect=jdbc:mysql://mydbhost:3306/database \
--username=hadoop \
--password-file=gs:/mybucket/creds/passwordFile.txt \
--target-dir=gs://mybucket/dump/mysql_output \
--table=mytable \
--as-avrodatafile
Bidyadharbarik84 commented 5 years ago

Thanks for information , let me try .

Bidyadharbarik84 commented 5 years ago

Unable to import data from mysql to hive using sqoop through dataproc cluster

sqoop import --connect "jdbc:mysql://localhost:3306/test" \
    --username dev2 --password dev2 --table test_09  \
    --target-dir gs://testbuket  --fields-terminated-by "," \
    --hive-import --create-hive-table --hive-table default.test_09 -m 1

Error:

Warning: /usr/lib/sqoop/bin/../../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /usr/lib/sqoop/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/lib/sqoop/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/03/25 10:29:08 INFO sqoop.Sqoop: Running Sqoop version: 1.5.0-SNAPSHOT
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/sqoop/build/ivy/lib/sqoop/test/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/sqoop/build/ivy/lib/sqoop/test/logback-classic-1.0.9.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/sqoop/build/ivy/lib/sqoop/test/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/03/25 10:29:08 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/03/25 10:29:08 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/03/25 10:29:08 INFO tool.CodeGenTool: Beginning code generation
19/03/25 10:29:08 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
        at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:874)
        at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:59)
        at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:762)
        at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:785)
        at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:288)
        at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:259)
        at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:245)
        at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:333)
        at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1879)
        at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1672)
        at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:106)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:515)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:633)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:146)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:182)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:233)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:242)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:251)
szewi commented 5 years ago

@Bidyadharbarik84 Thanks for submitting this. The root cause of the error is missing MySQL jar on the classpath. The easiest to fix it is to pass additionally at the very end of initialization-actions list gs://dataproc-initialization-actions/hive-hcatalog/hive-hcatalog.sh when creating a cluster. Example below:

gcloud dataproc clusters create sqoop \
    --initialization-actions gs://<sqoop-init-action>,gs://dataproc-initialization-actions/hive-hcatalog/hive-hcatalog.sh

You can also test symlinking mysql jar to sqoop lib dir: ln -s /usr/share/java/mysql.jar /usr/lib/sqoop/lib/mysql.jar

Bidyadharbarik84 commented 5 years ago

Hello Szewi,

I have created symlink to mysql jar but getting below error:

main ERROR Could not register mbeans java.security.AccessControlException: access denied ("javax.management.MBeanTrustPermission" "register")
        at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)

Please help me. Thanks for advance .

Bidyadharbarik84 commented 5 years ago

HI , I am trying to import data from cloud mysql to hive warehouse, using data proc hadoop cluster . Could you please provide sample script for that. Thanks.

Bidyadharbarik84 commented 5 years ago

After ran sqoop import command, getting below errors . Could you please help me with below errors?

19/04/02 07:05:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `test_data` AS t LIMIT 1
19/04/02 07:05:06 INFO hive.HiveImport: Loading uploaded data into Hive
19/04/02 07:05:06 INFO conf.HiveConf: Found configuration file null
2019-04-02 07:05:07,623 main ERROR Could not register mbeans java.security.AccessControlException: access denied ("javax.management.MBeanTrustPermission" "register")
        at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
medb commented 5 years ago

You must be facing SQOOP-3274 issue.

You can workaround it by disabling log4j JMX using log4j2.disable.jmx option:

sqoop import -D log4j2.disable.jmx=true --connect ...
Bidyadharbarik84 commented 5 years ago

Could you please add some CouldSQL sample script for using Sqoop and Hive?

19/04/02 07:05:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM test_data AS t LIMIT 1
19/04/02 07:05:06 INFO hive.HiveImport: Loading uploaded data into Hive
19/04/02 07:05:06 INFO conf.HiveConf: Found configuration file null
2019-04-02 07:05:07,623 main ERROR Could not register mbeans java.security.AccessControlException: access denied ("javax.management.MBeanTrustPermission" "register")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
pmiDataGen commented 5 years ago

How sqoop that reads from an Oracle Datasource.? Any Idea

martijnvdgrift commented 5 years ago

@Bidyadharbarik84 See the sample below. CloudSQL is just a MySQL/PostgreSQL database, so it will accept the same JDBC connection string.

gcloud dataproc jobs submit hadoop \
--cluster=incremental-imports-martijn-sqoop \
--class=org.apache.sqoop.Sqoop \
--jars=gs://mybucket/jars/sqoop-1.4.7-hadoop260.jar,gs://mybucket/jars/avro-tools-1.8.2.jar,file:///usr/share/java/mysql-connector-java-5.1.42.jar \
-- import -Dmapreduce.job.user.classpath.first=true \
--connect=jdbc:mysql://mydbhost:3306/database \
--username=hadoop \
--password-file=gs:/mybucket/creds/passwordFile.txt \
--target-dir=gs://mybucket/dump/mysql_output \
--table=mytable \
--hive-database=my_hive_database \
--hive-import \
--hive-overwrite \
--hive-table=my_hive_table \
--hive-home=/usr/lib/hive/lib/ 

@pmiDataGen Just add a OracleDriver to the jars command (this Jar can live in a GCS bucket) and use a JDBC connection string with an Oracle scheme.

Also important: Make sure to an image v1.1 based DataProc cluster, since Sqoop 1.4.7 is incompatible with the newer Hadoop found in later Dataproc images.

rod-m commented 5 years ago

I created cluster with:

gcloud dataproc clusters create rodsbigdata --zone us-central1-a --scopes default,sql-admin \
    --image-version 1.1 \
    --initialization-actions gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh \
    --properties hive:hive.metastore.warehouse.dir=gs://rodbigdatabucket/hive-warehouse \
    --metadata "enable-cloud-sql-hive-metastore=false" \
    --metadata "additional-cloud-sql-instances=big-data-tools-salford:us-central1:rodmysqlgoogle=tcp:3307"

Using this command to import data from a Cloud SQL database:

gcloud dataproc jobs submit hadoop \
    --bucket=gs://rodbigdatabucket \
    --cluster=rodsbigdata \
    --class=org.apache.sqoop.Sqoop \
    --jars=gs://rodbigdatabucket/jars/sqoop-1.4.7-hadoop260.jar,gs://rodbigdatabucket/jars/avro-tools-1.8.2.jar,file:///usr/share/java/mysql-connector-java-5.1.42.jar \
    -- \
    import -Dmapreduce.job.user.classpath.first=true \
    --connect=jdbc:mysql://localhost:3307/road_safety_data \
    --username rodmartin --target-dir=gs://rodbigdatabucket/mysql_output \
    --table road_safety_data \
    --hive-import \
    --hive-overwrite \
    --target-dir gs://rodbigdatabucket/hivedata

Output:

Job [7afdd9ceeae549fbb5cfbf7c3706b3e6] submitted.
Waiting for job output...
19/04/20 18:35:40 WARN tool.SqoopTool: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
19/04/20 18:35:41 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/04/20 18:35:41 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
19/04/20 18:35:41 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
19/04/20 18:35:41 WARN sqoop.ConnFactory: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
Caused by: java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
    at org.apache.sqoop.manager.DefaultManagerFactory.accept(DefaultManagerFactory.java:67)
    at org.apache.sqoop.ConnFactory.getManager(ConnFactory.java:184)
    at org.apache.sqoop.tool.BaseSqoopTool.init(BaseSqoopTool.java:272)
    at org.apache.sqoop.tool.ImportTool.init(ImportTool.java:96)
    at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:616)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
    ... 5 more
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 16 more
ERROR: (gcloud.dataproc.jobs.submit.hadoop) Job [7afdd9ceeae549fbb5cfbf7c3706b3e6] failed with error:
Job failed with message [Exception in thread "main" java.lang.reflect.InvocationTargetException]. Additional details can be found in 'gs://dataproc-ac9deec7-51ac-448f-ad8a-61a3c899c059-us/google-cloud-dataproc-metainfo/5c0ac39d-4906-4336-a8fe-5d2a1a1013b1/jobs/7afdd9ceeae549fbb5cfbf7c3706b3e6/driveroutput'.

Any help appreciated, I have tried many variations of the command. I started with a standard Dataproc image version then tried version 1.1 based on a comment above

medb commented 5 years ago

@rod-m You may want try to set mapreduce.job.classloader property to true:

gcloud dataproc jobs submit hadoop \
    --bucket=gs://rodbigdatabucket \
    --cluster=rodsbigdata \
    --class=org.apache.sqoop.Sqoop \
    --jars=gs://rodbigdatabucket/jars/sqoop-1.4.7-hadoop260.jar,gs://rodbigdatabucket/jars/avro-tools-1.8.2.jar,file:///usr/share/java/mysql-connector-java-5.1.42.jar \
    --properties=mapreduce.job.user.classpath.first=true,mapreduce.job.classloader=true \
    -- \
    import \
    --connect=jdbc:mysql://localhost:3307/road_safety_data \
    --username rodmartin --target-dir=gs://rodbigdatabucket/mysql_output \
    --table road_safety_data \
    --hive-import \
    --hive-overwrite \
    --target-dir gs://rodbigdatabucket/hivedata
rod-m commented 5 years ago

Thanks for the suggestion, it doesn't fix it for me though. (I am running the command from the Jupyter notebook installed on the cluster) I think the problem is with sqoop-1.4.7-hadoop260.jar not compatible with Hadoop 2.7.4! Where can I find sqoop-1.4.7-hadoop274.jar?

gcloud dataproc jobs submit hadoop \
    --cluster=rodsbigdata \
    --class=org.apache.sqoop.Sqoop \
    --jars=gs://rodbigdatabucket/jars/sqoop-1.4.7-hadoop260.jar,gs://rodbigdatabucket/jars/avro-tools-1.8.2.jar,file:///usr/share/java/mysql-connector-java-5.1.42.jar \
    -- \
    import -Dmapreduce.job.user.classpath.first=true \
    --properties 'mapred:mapreduce.job.user.classpath.first=true','mapred:mapreduce.job.classloader=true' \
    --connect=jdbc:mysql://cloudsql/big-data-tools-salford:us-central1:rodmysqlgoogle/road_safety_data \
    --username rodmartin \
    --target-dir=gs://rodbigdatabucket/mysql_output \
    --table road_safety_data \
    --hive-import

Now I just get loads of unrecognized commands:

19/04/21 15:59:40 WARN tool.SqoopTool: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
19/04/21 15:59:40 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Error parsing arguments for import:
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: --properties
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: mapred:mapreduce.job.user.classpath.first=true,mapred:mapreduce.job.classloader=true
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: --connect=jdbc:mysql://cloudsql/big-data-tools-salford:us-central1:rodmysqlgoogle/road_safety_data
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: --username
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: rodmartin
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: --target-dir=gs://rodbigdatabucket/mysql_output
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: --table
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: road_safety_data
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: --hive-import

Has anyone tried any of this in a recent Google Cloud Dataproc cluster?

medb commented 5 years ago

You put --properties parameter in the wrong place, this is gcloud command parameter, not Sqoop parameter, that's why it should be before -- delimiter that separates gcloud parameters from job's (in this case Sqoop) parameters.

Also, during job submission you don't need to specify mapreduce: prefix for properties, it's used only in cluster creation commands.

May you execute this command without any modifications from Linux command line (not Jupyter notebook) and post result here:

gcloud dataproc jobs submit hadoop \
    --cluster=rodsbigdata \
    --class=org.apache.sqoop.Sqoop \
    --jars=gs://rodbigdatabucket/jars/sqoop-1.4.7-hadoop260.jar,gs://rodbigdatabucket/jars/avro-tools-1.8.2.jar,file:///usr/share/java/mysql-connector-java-5.1.42.jar \
    --properties mapreduce.job.user.classpath.first=true,mapreduce.job.classloader=true \
    -- \
    import \
    --connect=jdbc:mysql://cloudsql/big-data-tools-salford:us-central1:rodmysqlgoogle/road_safety_data \
    --username rodmartin \
    --target-dir=gs://rodbigdatabucket/mysql_output \
    --table road_safety_data \
    --hive-import
rod-m commented 5 years ago

Thanks for the suggestion, it doesn't fix it for me though. (I am running the command from the Jupyter notebook installed on the cluster) I think the problem is with sqoop-1.4.7-hadoop260.jar not compatible with hadoop 2.7.4! Where can I find sqoop-1.4.7-hadoop274.jar ?

! gcloud dataproc jobs submit hadoop \
--cluster=rodsbigdata \
--class=org.apache.sqoop.Sqoop \
--jars=gs://rodbigdatabucket/jars/sqoop-1.4.7-hadoop260.jar,gs://rodbigdatabucket/jars/avro-tools-1.8.2.jar,file:///usr/share/java/mysql-connector-java-5.1.42.jar \
-- import -Dmapreduce.job.user.classpath.first=true \
--properties 'mapred:mapreduce.job.user.classpath.first=true','mapred:mapreduce.job.classloader=true' \
--connect=jdbc:mysql://cloudsql/big-data-tools-salford:us-central1:rodmysqlgoogle/road_safety_data \
--username rodmartin \
--target-dir=gs://rodbigdatabucket/mysql_output \
--table road_safety_data \
--hive-import

Now I just get loads of Unrecognized commands:

19/04/21 15:59:40 WARN tool.SqoopTool: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
19/04/21 15:59:40 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Error parsing arguments for import:
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: --properties
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: mapred:mapreduce.job.user.classpath.first=true,mapred:mapreduce.job.classloader=true
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: --connect=jdbc:mysql://cloudsql/big-data-tools-salford:us-central1:rodmysqlgoogle/road_safety_data
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: --username
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: rodmartin
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: --target-dir=gs://rodbigdatabucket/mysql_output
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: --table
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: road_safety_data
19/04/21 15:59:40 ERROR tool.BaseSqoopTool: Unrecognized argument: --hive-import`

Has anyone tried any of this in a recent google clouad dataproc cluster?

rod-m commented 5 years ago

From the google cloud console

gcloud dataproc jobs submit hadoop \
    --cluster=rodsbigdata \
    --class=org.apache.sqoop.Sqoop \
    --jars=gs://rodbigdatabucket/jars/sqoop-1.4.7-hadoop260.jar,gs://rodbigdatabucket/jars/avro-tools-1.8.2.jar,file:///usr/share/java/mysql-connector-java-5.1.42.jar \
    --properties mapreduce.job.user.classpath.first=true,mapreduce.job.classloader=true \
    -- \
    import \
    --connect=jdbc:mysql://cloudsql/big-data-tools-salford:us-central1:rodmysqlgoogle/road_safety_data \
    --username rodmartin \
    --target-dir=gs://rodbigdatabucket/mysql_output \
    --table road_safety_data \
    --hive-import

Output:

Job [a3f6ff982f7f46bea2e1eca98e002e08] submitted.
Waiting for job output...
19/04/21 17:35:32 WARN tool.SqoopTool: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
19/04/21 17:35:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/04/21 17:35:32 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
19/04/21 17:35:32 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
19/04/21 17:35:32 WARN sqoop.ConnFactory: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
Caused by: java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
        at org.apache.sqoop.manager.DefaultManagerFactory.accept(DefaultManagerFactory.java:67)
        at org.apache.sqoop.ConnFactory.getManager(ConnFactory.java:184)
        at org.apache.sqoop.tool.BaseSqoopTool.init(BaseSqoopTool.java:272)
        at org.apache.sqoop.tool.ImportTool.init(ImportTool.java:96)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:616)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
        ... 5 more
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 16 more
medb commented 5 years ago

Re https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/issues/268#issuecomment-485269488

Class org.apache.avro.LogicalType available only in Avro 1.8: https://search.maven.org/search?q=fc:org.apache.avro.LogicalType%20AND%20a:avro

Try to use Sqoop 1.4.6 and Avro 1.7.7 (the same version is used by Hadoop) instead of Sqoop 1.4.7 that uses Avro 1.8.

Bidyadharbarik84 commented 5 years ago

HI Can you someone help me , how to load data from GCS bucket to cloud sql using dataproc with sqoop.

kranthikumar9625 commented 5 years ago

running following sqoop command in GCP hadoop cluster

/usr/lib/sqoop/bin/sqoop import \
 --connect jdbc:mysql://34.67.251.162:3306/retail_db \
 --username root \
 --password root \
 --table order_items \
 --hive-import \
 --hive-database datkranthi \
 --create-hive-table \
 --as-textfile \
 --fields-terminated-by ';' \
 -m 3

I am getting following error:

        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.sqoop.hive.HiveImport.executeScript(HiveImport.java:331)
        at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:241)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:537)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:252)

19/08/17 14:00:23 WARN conf.HiveConf: HiveConf of name hive.hwi.war.file does not exist

Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true
19/08/17 14:00:23 INFO SessionState: 
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true
19/08/17 14:00:23 INFO session.SessionState: Created HDFS directory: /tmp/hive/kranthikumar9625/b6260189-3711-4897-bc78-6b89064e40cc
19/08/17 14:00:24 INFO session.SessionState: Created local directory: /tmp/kranthikumar9625/b6260189-3711-4897-bc78-6b89064e40cc
19/08/17 14:00:24 INFO session.SessionState: Created HDFS directory: /tmp/hive/kranthikumar9625/b6260189-3711-4897-bc78-6b89064e40cc/_tmp_space.db
19/08/17 14:00:24 ERROR tool.ImportTool: Import failed: java.io.IOException: Exception thrown in Hive
        at org.apache.sqoop.hive.HiveImport.executeScript(HiveImport.java:358)
        at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:241)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:537)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.sqoop.hive.HiveImport.executeScript(HiveImport.java:331)
        ... 9 more
Caused by: java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning
        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:614)
        at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:549)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:750)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
        ... 14 more
Caused by: java.lang.ClassNotFoundException: org.apache.tez.dag.api.SessionNotRunning
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 18 more

Please answer this

medb commented 4 years ago

Sqoop init action was added in #465