amplab / spark-ec2

Scripts used to setup a Spark cluster on EC2
Apache License 2.0
392 stars 299 forks source link

Include Python 3 in pre-baked AMI #74

Open wearpants opened 7 years ago

wearpants commented 7 years ago

This would be very, very nice in 2016/2017. Or at least provide some instructions in README on how to do so.

shivaram commented 7 years ago

I'm not really familiar with what needs to be done to make Spark use Python 3. cc @nchammas who might know more.

nchammas commented 7 years ago

Getting Spark to use Python 3 is generally a simple matter of setting PYSPARK_PYTHON=python3 or something similar. But this question is about including Python 3 in the pre-baked AMI, which is a separate matter.

Unfortunately, at this point I think it's on you to use something like pssh to install Python 3 on the cluster after it's been launched. You can also create your own set of AMIs from the spark-ec2 AMIs, but it's quite a bit of work if you don't have the right toolchain already setup.

We've discussed updating the spark-ec2 AMIs for some time now, but it's a lot of manual work and previous efforts to automate the process fizzled out. It's one of the reasons why I made Flintrock. Flintrock doesn't depend on any custom AMIs, and it offers a run-command feature. So you can easily use Flintrock with your own AMI, or install Python 3 post-launch with a simple call to flintrock run-command mycluster 'sudo yum install -y python35'.

So looking at the big-picture, I think it would be beneficial to all if spark-ec2 did some combination of the following:

shivaram commented 7 years ago

Thanks @nchammas - The third option of adding a run-command is something is probably the easiest to do as it should be just a code change.

The question of AMIs - I have spent some time thinking about it and long term it looks like there are two issues we will hit in the long run (a) manual labor involved in creating new AMIs (b) the cost overhead of storing those AMIs especially as we have one for each region, HVM, PVM etc.

While automation can solve part of the problem, it'll still be good to have zero effort if possible. So your second point of decoupling spark-ec2 from custom AMIs sounds better for that. I guess the main concern then is how long does it take to install all the necessary tools. @nchammas Have you found this to not be a significant overhead in flintrock ?

nchammas commented 7 years ago

Flintrock clusters have fewer out-of-the-box tools compared to spark-ec2. We don't install ganglia, for example. So the launch burden is lower, and it frees Flintrock to use the generic Amazon Linux AMIs.

Currently, Flintrock defaults to install Java 8 (if not already detected on the hosts) and Spark. You can flip a switch and have Flintrock also install HDFS. That's pretty much it.

The overhead of installing these 3 things at launch time isn't large, especially if you configure HDFS to download from S3 and not the Apache mirrors, which can be very slow. (Spark by default downloads from S3.) Flintrock can generally launch 100+ node clusters in under 5 minutes.

shivaram commented 7 years ago

I see - The thing I was looking at is the old packer scripts that are part of the issue you linked above[1,2]. Certainly those have a large number of install steps and I guess we'll need to benchmark things to figure out how long they will take. I wonder if we can remove support for a bunch of them and make them optional via the run-command (i.e. we have a python-setup.sh in the spark-ec2 tree and somebody can add that to their command line if they want to use python etc.)

[1] https://github.com/nchammas/spark-ec2/blob/packer/image-build/tools-setup.sh#L1 [2] https://github.com/nchammas/spark-ec2/blob/packer/image-build/python-setup.sh

nchammas commented 7 years ago

I wonder if we can remove support for a bunch of them and make them optional via the run-command (i.e. we have a python-setup.sh in the spark-ec2 tree and somebody can add that to their command line if they want to use python etc.)

That sounds like a reasonable approach to me.