geotrellis / geotrellis-pointcloud

GeoTrellis PointCloud library to work with any pointcloud data on Spark
Apache License 2.0
26 stars 10 forks source link

S3 reader: it is not connecting to the defined endpoint. #8

Closed romulogoncalves closed 6 years ago

romulogoncalves commented 6 years ago

Hi,

I trying to read pointcloud data to Spark from a local storage which provides S3 API (in our case we use Minio. To do that, I define the following in the core-site.xml:

fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.endpoint http://<end_point_ip>:9091
fs.s3a.access.key <access_key>
fs.s3a.secret.key <secret_key>
fs.s3a.connection.ssl.enabled false
fs.s3a.path.style.access true

When I do a normal read from S3 is works, for example:

val sonnets = sc.textFile("s3a://files/sonnets.txt")
val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
coutns.take(1)

When I try to read a laz file using S3PointCloudRDD, it fails because it attempts to connect to AWD service. I use the following code:

val s3_laz_path = "s3a://ahn3"
val s3_files = "C_25EZ2.laz"
val pipelineExpr = Read("local") ~ HagFilter()//~ LasWrite(s3_laz_path + "C_25EZ2_hag.laz")

val s3_rdd_laz = S3PointCloudRDD(s3_laz_path, s3_files, options = S3PointCloudRDD.Options(pipeline = pipelineExpr))

Which defines a RDD:

s3_laz_path = s3a://ahn3
s3_files = C_25EZ2.laz
pipelineExpr = List(Read(local,None,None,None), HagFilter(filters.hag))
s3_rdd_laz = NewHadoopRDD[0] at newAPIHadoopRDD at S3PointCloudRDD.scala:88

NewHadoopRDD[0] at newAPIHadoopRDD at S3PointCloudRDD.scala:88

Then I ask for the schema, i.e., it execute it:

s3_rdd_laz.map{ case (h, i) => h.schema}.take(1)

Error:

ame: com.amazonaws.SdkClientException
Message: Unable to load AWS credentials from any provider in the chain
StackTrace:   at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:131)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1164)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:762)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:724)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4319)
  at com.amazonaws.services.s3.AmazonS3Client.getBucketRegionViaHeadRequest(AmazonS3Client.java:5080)
  at com.amazonaws.services.s3.AmazonS3Client.fetchRegionFromCache(AmazonS3Client.java:5054)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4303)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4266)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4260)
  at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:831)
  at geotrellis.spark.io.s3.AmazonS3Client.listObjects(AmazonS3Client.scala:45)
  at geotrellis.spark.io.s3.S3Client$$anon$1.<init>(S3Client.scala:113)
  at geotrellis.spark.io.s3.S3Client$class.listObjectsIterator(S3Client.scala:112)
  at geotrellis.spark.io.s3.AmazonS3Client.listObjectsIterator(AmazonS3Client.scala:43)
  at geotrellis.spark.io.s3.S3InputFormat.getSplits(S3InputFormat.scala:127)
  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:127)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1337)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.take(RDD.scala:1331)

It seems it is connecting to AWS serivces and not the endpoint which we defined in the core-site.xml. Do we need to set some extra configuration? Reading from HDFS is works without issues.

pomadchin commented 6 years ago

Hey @romulogoncalves, you also need to setup aws s3 sdk credentials: https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html

pomadchin commented 6 years ago

GeoTrellis reads data from S3 via S3 SDK usage and not via Hadoop API. link

pomadchin commented 6 years ago

In your case if you still want to use s3 but using Hadoop API try to use a common HadoopPointCloudRDD

romulogoncalves commented 6 years ago

@pomadchin thanks for the quick reply.

Now I understood how S3PointCloudRDD works.

For now I will just use HadoopPointCloudRDD to read from a local object storage with S3 Api. I just tested and it works.

Out of curiosity, is there any performance difference when using HadoopPointCloudRDD instead of S3PointCloudRDD? Why not only have HadoopPointCloudRDD to access HDFS and object storage with S3 API?

pomadchin commented 6 years ago

@romulogoncalves there is, S3 should work faster. Hadoop API is a bit slower in this case, though you can double check.

romulogoncalves commented 6 years ago

Thanks for the reply. I think the issue can be closed.