Ray Error when running ray.init() on a Jupyter Notebook with an EMR cluster attached

egelberg commented 2 years ago

When attempting to initialize Ray on an EMR cluster that bootstraps the init script in this repo, I'm hitting the following error: My goal is to utilize Ray in a Jupyter notebook that has an EMR cluster attached. I created a small cluster which bootstraps the init script in this repo. I've created a PySpark notebook, where I run

import ray
ray.init()

This then produces the following error:

An error was encountered:
Interpreter died:

[2022-10-13 20:32:04,701 E 30765 30765] core_worker.cc:137: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

cfregly commented 2 years ago

Couple follow up questions: 1/ are you using EMR Studio? 2/ which instance types? 3/ can you share a screenshot (without sensitive info)

egelberg commented 2 years ago

Couple follow up questions: 1/ are you using EMR Studio? 2/ which instance types? 3/ can you share a screenshot (without sensitive info)

Thanks for jumping on this!

1/ Nope, just EMR notebooks on EC2. I'm trying to run this in a PySpark notebook to take advantage of the full cluster, not just the driver node

2/ Instance types:

Master: m5.xlarge
Workers: 4x c5.xlarge

3/ Here's some more specifics

EMR Configuration

aws emr create-cluster --os-release-label 2.0.20220912.1 --applications Name=Livy Name=Spark Name=JupyterEnterpriseGateway Name=Hadoop Name=JupyterHub Name=Hive Name=Pig Name=TensorFlow Name=Tez --ec2-attributes '{"InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-*****************","EmrManagedSlaveSecurityGroup":"sg-*****************","EmrManagedMasterSecurityGroup":"sg-*****************"}' --release-label emr-6.8.0 --log-uri 's3n://aws-logs-************-us-east-1/elasticmapreduce/' --instance-groups '[{"InstanceCount":4,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":64,"VolumeType":"gp2"},"VolumesPerInstance":4}]},"InstanceGroupType":"CORE","InstanceType":"c5.xlarge","Name":"Core - 2"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"MASTER","InstanceType":"m5.xlarge","Name":"Master - 1"}]' --configurations '[{"Classification":"hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}},{"Classification":"spark-hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}]' --auto-scaling-role EMR_AutoScaling_DefaultRole --bootstrap-actions '[{"Path":"s3://sagemaker-us-east-1-************/egelberg/install_ray.sh","Name":"Ray install"}]' --ebs-root-volume-size 10 --service-role EMR_DefaultRole --enable-debugging --auto-termination-policy '{"IdleTimeout":9000}' --name 'Ray - 4x c5.xlarge' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1

Screenshots:

Simple ray.init() call:

Trying to specify the number of cpus after reading through this thread:

Attempting to run a script to return IP addresses across a cluster (able to run in Sagemaker):

aws-samples / aws-samples-for-ray

Ray Error when running ray.init() on a Jupyter Notebook with an EMR cluster attached #5