Use it with popular services

mycaule commented 6 years ago

Hello,

It would be nice if you could you provide instructions on how to use it with AWS (AWS EMR, Flintrock on EC2) ou GCP (Google Cloud Dataproc), and how to use it from IntelliJ as well.

This could be a great CLI alternative to Zeppelin.

alexarchambault commented 5 years ago

FYI, this gist lists commands to get ammonite-spark up-and-running with an EMR cluster. I'd like to make it an actual tutorial, but didn't find the time to do that yet.

alexarchambault commented 5 years ago

cc @mpacer who was also interested by that (link in my previous comment)

mycaule commented 5 years ago

Thank you very much.

The script works on my side.

I found Heather Miller's tutorial on Flintrock + S3 quite cool if one day you write a tutorial from the gist.

IMPORTANT!: Before we can go any further, we need to ensure that a handful of specific dependencies for S3 are available on our Spark cluster. As of the time of writing, this is a workaround, which can be solved by downloading a recent Hadoop 2.7.x distribution and a specific, older version of an AWS JAR (1.7.4) that is typically not available in the EC2 Maven Repository.

The hadoop-aws-2.7.2.jar and aws-java-sdk-1.7.4.jar stuff she mentions can be quite tricky, it would be nice to have a sample code with a read on Amazon S3 as well : the classpath problem looks similar to this PR https://github.com/alexarchambault/ammonite-spark/pull/7
It would be nice if you provide examples on how to read .amm files too

mycaule commented 5 years ago

I get this error message when trying to read Parquet on S3 :

@ spark.read.parquet("s3a://bucket/path/to/parquet") 
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
  org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
  org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
  org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

This import wasn't enough:

@ import $ivy.`org.apache.hadoop:hadoop-aws:2.8.4`, $ivy.`com.amazonaws:aws-java-sdk-s3:1.11.336`, $ivy.`com.amazonaws:aws-java-sdk-emr:1.11.336`

alexarchambault commented 5 years ago

@mycaule Did you add the extra dependencies before creating the Spark session? (or call AmmoniteSparkSession.sync() else)

mycaule commented 5 years ago

I added it after, will try this afternoon to add them before or use sync thanks.

mycaule commented 5 years ago

After adding imports at the correct place,

@ import $ivy.`com.sun.jersey:jersey-client:1.9.1`, $ivy.`org.apache.spark::spark-sql:2.3.1`, $ivy.`sh.almond::ammonite-spark:0.1.1`
@ import $ivy.`org.apache.hadoop:hadoop-aws:2.8.4`, $ivy.`com.amazonaws:aws-java-sdk-s3:1.11.336`, $ivy.`com.amazonaws:aws-java-sdk-emr:1.11.336`
@ val spark = {
             AmmoniteSparkSession.builder()
               .progressBars()
               .master("yarn")
               .config("spark.executor.instances", "4")
               .config("spark.executor.memory", "2g")
               .getOrCreate()
           }
@ ...

... I get another error now, making progress...

I am using EMR 5.16 and using latest versions available and supported by the platform.

@ spark.read.parquet("s3a://bucket/path/to/parquet") 
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
") 
  org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
  org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
  org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
  org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
  org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
  org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
  org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
  org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
  org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
  org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
  org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
  org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
  org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
  org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)
  org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)

And without S3a

@ spark.read.parquet("s3://bucket/path/to/parquet") 
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
  org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
  org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)

The same command works fine in the default spark-shell available with EMR (without imports).

scala> spark.read.parquet("s3://bucket/path/to/parquet")
18/11/13 17:05:56 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
18/11/13 17:05:56 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
18/11/13 17:05:57 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = [id: int, idannonce: int ... 198 more fields]

mycaule commented 5 years ago

To solve the Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found, two related issues that tells us to add these to the classpath

/usr/share/aws/emr/emrfs/lib/*
/usr/share/aws/emr/emrfs/auxlib/*
/usr/share/aws/emr/emr-metrics/lib/*
/usr/share/aws/emr/emrfs/conf

or 

[
  {
    "classification":"spark-defaults",
    "properties": {
      "spark.executor.extraClassPath": "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*",
      "spark.driver.extraClassPath": "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*"
    }
  }
]

https://github.com/spark-notebook/spark-notebook/issues/368 https://forums.aws.amazon.com/thread.jspa?messageID=699917

https://github.com/alexarchambault/ammonite-spark/blob/develop/INTERNALS.md

kyprifog commented 4 years ago

@mycaule I am getting the same java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong error when trying to access S3 from spark notebook.

I think its a spark/hadoop version issue though.

kyprifog commented 4 years ago

Downgrading to

import $ivy.`org.apache.hadoop:hadoop-aws:2.7.5`

fixed the issue for me (using spark 2.4.2)

alexarchambault / ammonite-spark

Use it with popular services #5