nmldiegues commented 5 years ago

The "httpclient" dependency is very popular and used across many projects. In this case, H2O depends on it and it conflicted with AWS dependencies when we use H2O OpenML in Spark jobs on AWS EMR.

This way we make it provided so that it can be decided by users.

nmldiegues commented 5 years ago

The original problem was: TaskSetManager.logWarning:66 Lost task 0.0 in stage 0.0 (TID 0, ip-10-52-126-136.eu-west-2.compute.internal, executor 1): java.lang.NoSuchFieldError: INSTANCE at org.apache.http.conn.ssl.SSLConnectionSocketFactory.<clinit>(SSLConnectionSocketFactory.java:146) at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.getPreferredSocketFactory(ApacheConnectionManagerFactory.java:86) at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:63) at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:56) at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:50) at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38) at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:315) at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:299) at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:169) at com.amazonaws.services.s3.AmazonS3Client.<init>(AmazonS3Client.java:579) at com.amazonaws.services.s3.AmazonS3Client.<init>(AmazonS3Client.java:559) at com.amazonaws.services.s3.AmazonS3Client.<init>(AmazonS3Client.java:537) at org.apache.hadoop.fs.s3a.S3ClientFactory$DefaultS3ClientFactory.createAmazonS3Client(S3ClientFactory.java:202) at org.apache.hadoop.fs.s3a.S3ClientFactory$DefaultS3ClientFactory.createS3Client(S3ClientFactory.java:78) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:186) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2859) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at com.feedzai.pulse.datascience.datasource.csv.CsvDataSourceStringSplitReader.getDataSplitData(CsvDataSourceStringSplitReader.java:226) at com.feedzai.pulse.datascience.datasource.csv.CsvDataSourceSplitReader.getDataSplitData(CsvDataSourceSplitReader.java:79) at com.feedzai.distributed.job.backend.spark.rdd.DataReaderRDD.compute(DataReaderRDD.java:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Validated manually that the class AllowAllHostnameVerifier.classappeared in the openml-h2o jar and no longer appears after this change.

TravisBuddy commented 5 years ago

Hey @nmldiegues,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 8e238920-f97a-11e8-92dd-7b192e922be5

codecov[bot] commented 5 years ago

Codecov Report

Merging #5 into master will not change coverage. The diff coverage is n/a.

@@            Coverage Diff            @@
##             master       #5   +/-   ##
=========================================
  Coverage     75.43%   75.43%           
  Complexity      220      220           
=========================================
  Files            22       22           
  Lines           753      753           
  Branches         70       70           
=========================================
  Hits            568      568           
  Misses          147      147           
  Partials         38       38

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2776e64...cd366e7. Read the comment docs.

nmldiegues commented 5 years ago

This passed on our automated tests on Spark embedded and Spark on Cloudera or Standalone. Therefore proceeding to merge, back-port and release 0.5 hotfix.

feedzai / feedzai-openml-java

Make httpclient dependency of H2O provided #5

TravisBuddy Request Identifier: 8e238920-f97a-11e8-92dd-7b192e922be5

Codecov Report