aliyun / aliyun-maxcompute-data-collectors

Apache License 2.0
121 stars 63 forks source link

Sqoop could not parse record when exporting data from MaxCompute to PostgreSQL #9

Open giaosudau opened 7 years ago

giaosudau commented 7 years ago

Hi Ali, I am using sqoop to export data from Maxcompute to Postgres.

./odps-sqoop/bin/sqoop export --connect jdbc:postgresql://localhost:5432/replication_db --table dim_wmp_cabinet \
    --username replication_user --password replication_pass \
    --odps-table dim_wmp_cabinet --odps-project xxx --odps-accessid xxx \
    --odps-tunnel-endpoint http://xxxx \
    --odps-partition-spec ds=20170916 \
    --odps-accesskey xxxx --odps-endpoint http://sxxx/api

I am looking into this code OdpsExportMapper.java

try {
      odpsImpl.parse(val);
      context.write(odpsImpl, NullWritable.get());

I am success to add tunnel endpoint into source code but couldn't get through this one.

Please help fix this issue.

17/09/18 18:16:46 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
17/09/18 18:16:46 INFO mapreduce.Job: Running job: job_local873411290_0001
17/09/18 18:16:46 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/09/18 18:16:46 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.sqoop.mapreduce.NullOutputCommitter
17/09/18 18:16:46 INFO mapred.LocalJobRunner: Waiting for map tasks
17/09/18 18:16:46 INFO mapred.LocalJobRunner: Starting task: attempt_local873411290_0001_m_000000_0
17/09/18 18:16:46 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
17/09/18 18:16:46 INFO mapred.Task:  Using ResourceCalculatorProcessTree : null
17/09/18 18:16:46 INFO mapred.MapTask: Processing split: org.apache.sqoop.mapreduce.odps.OdpsExportInputFormat$OdpsExportInputSplit@6a6d595f
17/09/18 18:16:46 ERROR odps.OdpsExportMapper: Exception raised during data export
17/09/18 18:16:46 ERROR odps.OdpsExportMapper: Exception:
java.lang.RuntimeException: Can't parse input data: '3'
    at dim_wmp_cabinet.__loadFromFields(dim_wmp_cabinet.java:2090)
    at dim_wmp_cabinet.parse(dim_wmp_cabinet.java:1533)
    at org.apache.sqoop.mapreduce.odps.OdpsExportMapper.map(OdpsExportMapper.java:77)
    at org.apache.sqoop.mapreduce.odps.OdpsExportMapper.map(OdpsExportMapper.java:35)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
    at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at dim_wmp_cabinet.__loadFromFields(dim_wmp_cabinet.java:2010)
    ... 13 more
17/09/18 18:16:46 ERROR odps.OdpsExportMapper: On input: com.aliyun.odps.data.ArrayRecord@4c665e0d
17/09/18 18:16:46 ERROR odps.OdpsExportMapper: At position 0
17/09/18 18:16:46 ERROR odps.OdpsExportMapper:
oyz commented 7 years ago

can you provide the schemas of table 'dim_wmp_cabinet' both in postgresql and MaxCompute respectively and give some example data please ?

oyz commented 7 years ago

The reason of this error was found out, it's because of the null data in the table. The latest code has fixed this.

giaosudau commented 7 years ago

Why don't include partition field value in the result? and why do you require partitionSpec?

oyz commented 7 years ago

In maxcompute ,to read a partitioned table must specify the partitionSpec. but the result record read from a specific partition not include partition value.

for sqoop, maybe we should add an option to enable appending partition values to result record for convenience.

giaosudau commented 7 years ago

It should be added because we want to load all data but without partition columns, we need to add a step to update partition data into.