CODAIT / stocator

Stocator is high performing connector to object storage for Apache Spark, achieving performance by leveraging object storage semantics.
Apache License 2.0
110 stars 72 forks source link

Possible Stocator config issue #227

Open desimonemike123 opened 4 years ago

desimonemike123 commented 4 years ago

We're hitting a stocator configuration issue within an HDP 2.6.5 cluster (ships with Spark 2.3.x, HDFS, YARN, MapReduce2 2.7.x).  Per stocator doc, we built the stocator-1.0.35-IBM-SDK.jar, configured IBM COS buckets, and using spark-submit with the --jars option we were was able read and write to the buckets without issue.  Within the spark program we set the _jsc.hadoopConfiguration() with the various req'd keys (fs.cos.serviceName.iam.api.key, etc...).   We also use Jupyter notebook, and to enable the notebook env, we installed the stocator.jar file into the .../hdp/2.6.5.0-292/spark2/jars and .../hdp/2.6.5.0-292/hadoop/lib directories across the cluster and was able read/write to the buckets without issue as well.

We typically define external Hive tables over our HDFS data and this is where we are encountering issues with stocator and COS. We determined that the stocator jar also needed to be installed under .../hdp/2.6.5.0-292/hive/lib.  We couldn't find a way to dynamically pass-in req'd keys to Hive (using Beeline or Spark) to create the table successfully.   Note that the table definition now has a Location parameter in the form  "cos://bucketName.serviceName/dir" .  We found that if we added all the req'd fs.cos keys to our clusters core-site.xml, we could then create the table in Hive.   Is there a way to dynamically pass-in the keys, having them present in core-xml presents security issues?  The new external table definition with Hive looks correct.   Where we're currently stuck is when we try to retrieve data from the table.  Whether we try to retrieve using spark.sql (select * from HiveTableName ...), or from beeline cli, we get an error that leads me to believe we're missing some configuration.  Detailed stackTrace info is below.  As you can see, stocator does appear to "List" the files in the bucket directory without issue.  But then we encounter the -- java.net.UnknownHostException: mab-ancillary.mab -- error. mab-ancillary is the bucketName and mab is the serviceName, so we believe we're missing a configuration step.  Note, for 'fun', we created a bogus DNS entry, with host 'mab-ancillary.mab' and ip-address of our COS endpoint, and then retrieval does in fact work.  Any help would be appreciated.

2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:isStocatorOrigin(1202)) - isStocatorOrigin: for anc_table/ 2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:isStocatorOrigin(1209)) - isStocatorOrigin: found cached for stocator origin for anc_table. Status true 2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:isStocatorOrigin(1202)) - isStocatorOrigin: for anc_table/ 2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:isStocatorOrigin(1209)) - isStocatorOrigin: found cached for stocator origin for anc_table. Status true 2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:createFileStatus(696)) - createFileStatus: found exact file: fake directory cos://mab-ancillary.mab/anc_table/_SUCCESS 2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:isStocatorOrigin(1202)) - isStocatorOrigin: for anc_table/ 2019-12-12 19:52:50,072 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:isStocatorOrigin(1209)) - isStocatorOrigin: found cached for stocator origin for anc_table. Status true 2019-12-12 19:52:50,073 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:createFileStatus(699)) - createFileStatus: found exact file: normal file cos://mab-ancillary.mab/anc_table/part-00000-c2be6382-7064-4d49-a819-6fbaf75d29b1-c000-attempt_20191209164602_0001_m_000000_0.csv 2019-12-12 19:52:50,073 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:createFileStatus(699)) - createFileStatus: found exact file: normal file cos://mab-ancillary.mab/anc_table/part-00001-c2be6382-7064-4d49-a819-6fbaf75d29b1-c000-attempt_20191209164602_0001_m_000001_0.csv 2019-12-12 19:52:50,073 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: cos.COSAPIClient (COSAPIClient.java:createFileStatus(699)) - createFileStatus: found exact file: normal file cos://mab-ancillary.mab/anc_table/part-00001-c2be6382-7064-4d49-a819-6fbaf75d29b1-c000-attempt_20191209164602_0001_m_000001_0.csv 2019-12-12 19:52:50,073 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: fs.ObjectStoreFileSystem (ObjectStoreFileSystem.java:listStatus(395)) - listStatus: cos://mab-ancillary.mab/anc_table completed. return 2 results 2019-12-12 19:52:50,079 INFO  [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: session.HiveSessionImpl (HiveSessionImpl.java:releaseBeforeOpLock(366)) - We are resetting the hadoop caller context for thread HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice 2019-12-12 19:52:50,079 DEBUG [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: security.UserGroupInformation (UserGroupInformation.java:doAs(1873)) - PrivilegedActionException as:ambari-server (auth:PROXY) via hive/hive.aice.svc.cluster.local@SL.CLOUD9.IBM.COM (auth:KERBEROS) cause:org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.IllegalArgumentException: java.net.UnknownHostException: mab-ancillary.mab 2019-12-12 19:52:50,079 WARN  [HiveServer2-HttpHandler-Pool: Thread-281 - /cliservice]: thrift.ThriftCLIService (ThriftCLIService.java:FetchResults(718)) - Error fetching results: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.IllegalArgumentException: java.net.UnknownHostException: mab-ancillary.mab at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:416) at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:243) at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:793) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:6

gilv commented 4 years ago

@desimonemike123 based on the log, it seems you are using Hive, right? If this is the case, then stocator doesn't support Hive flows

desimonemike123 commented 4 years ago

Thx for the fast reply Gil, I appreciate it. Yes, we do currently utilize Hive to store schema and partition metadata for our externally defined tables on HDFS. I was trying to follow that same convention for data that resides on IBM COS. I was reading documentation on IBM's Analytic Engine where it states utilizing Stocator as its connector to COS, and also provides a sample of a defining Hive tables over COS (https://cloud.ibm.com/docs/services/AnalyticsEngine?topic=AnalyticsEngine-working-with-hive). Therefore I made the assumption that Stocator supports Hive. I haven't stood up the analytic engine service yet, but assume either that dev team added in the Hive support, or I'll encounter a similar issue. It's my understanding that Spark takes advantage of the partition metadata stored in Hive when the table is queried (avoiding an up-front discovery of all partitions/sub-directories for the given data set). This is one of the reasons I'm trying support external Hive tables being located on both HDFS and COS within our HDP cluster.