dask / dask-yarn

Deploy dask on YARN clusters
http://yarn.dask.org
BSD 3-Clause "New" or "Revised" License
69 stars 41 forks source link

EMR bootstrap script fails #122

Open ResidentMario opened 4 years ago

ResidentMario commented 4 years ago

The EMR bootstrap script currently fails with the following error (found via stderr logs):

+ sudo mv /tmp/jupyter-notebook.conf /etc/init/
mv: cannot create regular file ‘/etc/init/’: Not a directory
mrocklin commented 4 years ago

Thank you for the error report @ResidentMario . My apologies in the delayed response. The folks who maintain this repository have been busy lately.

Do you have any interest in submitting a patch to resolve this issue?

ResidentMario commented 4 years ago

I might be able to look into it, but no promises.

nmerket commented 4 years ago

I am trying to debug some other issues with this and found that by using the EMR release emr-5.29.0 instead of emr-5.30.1 resolves the problem. It looks like something in the new image is causing the problem. Thought that bit of intel might help.

hegde-anish commented 4 years ago

Apparently emr-5.30 onwards they only support systemd and no longer support upstart.

datafuz commented 4 years ago

I think it has to do with Amazon Linux 2:

Amazon Linux 2 support – In EMR version 5.30.0 and later, EMR uses Amazon Linux 2 OS. New custom AMIs (Amazon Machine Image) must be based on the Amazon Linux 2 AMI. For more information, see Using a Custom AMI.

hamzahiqb commented 4 years ago

I tried to make it work with systemd and updated it with the following:

# -----------------------------------------------------------------------------
# 10. Configure Jupyter Notebook
# -----------------------------------------------------------------------------
echo "Configuring Jupyter"
mkdir -p $HOME/.jupyter
HASHED_PASSWORD=`python -c "from notebook.auth import passwd; print(passwd('$JUPYTER_PASSWORD'))"`
cat <<EOF >> $HOME/.jupyter/jupyter_notebook_config.py
c.NotebookApp.password = u'$HASHED_PASSWORD'
c.NotebookApp.open_browser = False
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.port = 8889
EOF

# # -----------------------------------------------------------------------------
# # 11. Define an upstart service for the Jupyter Notebook Server
# #
# # This sets the notebook server up to properly run as a background service.
# # -----------------------------------------------------------------------------
echo "Configuring Jupyter Notebook Upstart Service"
cat <<EOF > /tmp/jupyter-notebook.service
[Unit]
Description=Jupyter Notebook

[Service]
ExecStart=$HOME/miniconda/bin/jupyter-notebook --config=$HOME/.jupyter/jupyter_notebook_config.py
Type=simple
PIDFile=/run/jupyter.pid
WorkingDirectory=$HOME
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF
sudo mv /tmp/jupyter-notebook.service /etc/init.d/
sudo systemctl enable /etc/init.d/jupyter-notebook.service

# # -----------------------------------------------------------------------------
# # 12. Start the Jupyter Notebook Server
# # -----------------------------------------------------------------------------
# echo "Starting Jupyter Notebook Server"

sudo systemctl daemon-reload
sudo systemctl restart jupyter-notebook.service

Note: I added a port for the notebook.

This runs on bootstrap but there is nothing on port 8889. When I run the ExecStart command manually, via ssh, the notebook opens. So not sure what I'm doing wrong. I also get the following problem: https://github.com/dask/dask-yarn/issues/124

Sources for the new script:

  1. https://gist.github.com/klingtnet/76c542613e544a13bb7ad741b53f1f73
  2. https://medium.com/@joelzhang/setting-up-jupyter-notebook-server-as-service-in-ubuntu-16-04-116cf8e84781

EMR version 5.31.0 Hadoop distribution:Amazon 2.10.0 Python: 3.7.9

hegde-anish commented 4 years ago

Hi @hiqbal2, Your script for systemd was super helpful. I got it to work by doing a few changes to this script.

  1. ExecStart=$HOME/miniconda/bin/jupyter-notebook --allow-root --config=$HOME/.jupyter/jupyter_notebook_config.py
  2. sudo mv /tmp/jupyter-notebook.service /etc/systemd/system/
  3. sudo systemctl enable jupyter-notebook.service

I hope this helps

hamzahiqb commented 4 years ago

@hegde-anish thanks for the help. EMR seems to bootstrap properly now. However, not sure if you got this error when trying to start a dask cluster:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-762253d83df2> in <module>
      1 # Create a cluster
----> 2 cluster = YarnCluster()
      3 
      4 # Connect to the cluster
      5 client = Client(cluster)

/home/hadoop/miniconda/lib/python3.8/site-packages/dask_yarn/core.py in __init__(self, environment, n_workers, worker_vcores, worker_memory, worker_restarts, worker_env, scheduler_vcores, scheduler_memory, deploy_mode, name, queue, tags, user, host, port, dashboard_address, skein_client, asynchronous, loop)
    366         loop=None,
    367     ):
--> 368         spec = _make_specification(
    369             environment=environment,
    370             n_workers=n_workers,

/home/hadoop/miniconda/lib/python3.8/site-packages/dask_yarn/core.py in _make_specification(**kwargs)
    184             "See http://yarn.dask.org/environments.html for more information."
    185         )
--> 186         raise ValueError(msg)
    187 
    188     n_workers = lookup(kwargs, "n_workers", "yarn.worker.count")

ValueError: You must provide a path to a Python environment for the workers.
This may be one of the following:
- A conda environment archived with conda-pack
- A virtual environment archived with venv-pack
- A path to a conda environment, specified as conda://...
- A path to a virtual environment, specified as venv://...
- A path to a python binary to use, specified as python://...

See http://yarn.dask.org/environments.html for more information.

Not sure why this is happening.

I am also not sure if there is a difference in behaviour in just calling $HOME/miniconda/bin/jupyter-notebook vs the original script: exec su - hadoop -c "jupyter notebook". When I try the old command i get the error that hadoop -c does not exists.

I don't have any experience with hadoop or dask, so am a little lost on debugging this.

kqshan commented 3 years ago

This modified bootstrap script worked for me, with a few additional fixes:

To fix the latter two issues, I added unalias commands to ~/.bashrc before sourceing it, which feels like a bit of a hack:

# -----------------------------------------------------------------------------
# 2. Install Miniconda
# -----------------------------------------------------------------------------
echo "Installing Miniconda"
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh
bash /tmp/miniconda.sh -b -p $HOME/miniconda
rm /tmp/miniconda.sh
echo -e 'unalias python || true' >> $HOME/.bashrc
echo -e 'unalias pip || true' >> $HOME/.bashrc
echo -e '\nexport PATH=$HOME/miniconda/bin:$PATH' >> $HOME/.bashrc
source $HOME/.bashrc
conda update conda -y

and I specified a User in the systemd [Service] section (which also let me remove the --allow-root flag that @hegde-anish suggested). I also had to export the JAVA_HOME environment variable:

# -----------------------------------------------------------------------------
# 11. Define an upstart service for the Jupyter Notebook Server
#
# This sets the notebook server up to properly run as a background service.
# -----------------------------------------------------------------------------
echo "Configuring Jupyter Notebook Upstart Service"
cat <<EOF > /tmp/jupyter-notebook.service
[Unit]
Description=Jupyter Notebook

[Service]
User=hadoop
ExecStart=$HOME/miniconda/bin/jupyter-notebook --config=$HOME/.jupyter/jupyter_notebook_config.py
Environment=JAVA_HOME=$JAVA_HOME
Type=simple
PIDFile=/run/jupyter.pid
WorkingDirectory=$HOME
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF
sudo mv /tmp/jupyter-notebook.service /etc/systemd/system/
sudo systemctl enable jupyter-notebook

# -----------------------------------------------------------------------------
# 12. Start the Jupyter Notebook Server
# -----------------------------------------------------------------------------
echo "Starting Jupyter Notebook Server"
sudo systemctl daemon-reload
sudo systemctl start jupyter-notebook

EMR version 5.32.0 Hadoop distribution: Amazon 2.10.1 Python 3.7.6

hamzahiqb commented 3 years ago

The above worked for me. However, the jupyter notebook now just does not output any values. I tried to start the notebook via ssh and got the following error when trying to do a simple 2+2:

[E 12:12:00.355 NotebookApp] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fc061beb4d0>, <Future finished exception=TimeoutError('Timeout')>)
    Traceback (most recent call last):
      File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/ioloop.py", line 758, in _run_callback
        ret = callback()
      File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
        return fn(*args, **kwargs)
      File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/websocket.py", line 553, in <lambda>
        self.stream.io_loop.add_future(result, lambda f: f.result())
    tornado.util.TimeoutError: Timeout
ERROR:asyncio:Future exception was never retrieved
future: <Future finished exception=TimeoutError('Timeout')>
Traceback (most recent call last):
  File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/websocket.py", line 757, in _accept_connection
    yield open_result
  File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
  File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/hadoop/miniconda/lib/python3.7/site-packages/tornado/websocket.py", line 553, in <lambda>
    self.stream.io_loop.add_future(result, lambda f: f.result())
tornado.util.TimeoutError: Timeout
davegravy commented 3 years ago

@kqshan this is great, thanks.

I didn't find I needed to to unalias, after the bootstrap I had proper pointers to miniconda python/pip. I'm running a newer EMR (emr-6.2.0) so this may be a factor.

I removed the version pin for tornado as well.

The conda pack issue appears to be from this conda issue. I added --ignore-missing-files and it resolved although I don't know if I'll hit environment synchronization issues with my workers as a result (haven't gotten that far in testing yet)

Also the version spec for dask-yarn causes a file to be written to the home folder called ''=0.7.0". Some escaping or quoting likely necessary to fix but I just removed the version specification because conda installed 0.8.1 on its own.

tjburrows commented 3 years ago

I also ran into this issue when trying the bootstrap script. @davegravy did you test your script? Do you have a version that works with emr-6.2.0?

quasiben commented 3 years ago

What version of conda-pack is used ? I believe 0.6 was released a month ago

davegravy commented 3 years ago

I also ran into this issue when trying the bootstrap script. @davegravy did you test your script? Do you have a version that works with emr-6.2.0?

My script works with emr-6.2.0, yes. I've had no issues with dask & EMR, at least not that I can attribute to the bootstrap.

tjburrows commented 3 years ago

I also ran into this issue when trying the bootstrap script. @davegravy did you test your script? Do you have a version that works with emr-6.2.0?

My script works with emr-6.2.0, yes. I've had no issues with dask & EMR, at least not that I can attribute to the bootstrap.

Can you share it?

davegravy commented 3 years ago

I also ran into this issue when trying the bootstrap script. @davegravy did you test your script? Do you have a version that works with emr-6.2.0?

My script works with emr-6.2.0, yes. I've had no issues with dask & EMR, at least not that I can attribute to the bootstrap.

Can you share it?

Sure:

https://gist.github.com/davegravy/61e3abb81176f4490032554b70d28c31

gabriel131188 commented 3 years ago

Hello, i tried install dask with many versions of bootstrap and EMR versions but anything doesnt work. If it's possible share with me what EMR version and dask bootstrap you used. Thanks @davegravy in your bootstrap the line 125 is censured "Downloading pyquis step".

davegravy commented 3 years ago

Hello, i tried install dask with many versions of bootstrap and EMR versions but anything doesnt work. If it's possible share with me what EMR version and dask bootstrap you used. Thanks

Hi I was using EMR 6.2.0.

@davegravy in your bootstrap the line 125 is censured "Downloading pyquis step".

This is a private python library my bootstrap script installs. It shouldn't have any bearing on the bootstrap's ability to succeed.