desihub / desidocker

Making DESI data freely and easily accessible with AWS and Docker
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Integrated Docker environment for accessing cloud- and locally-hosted DESI data

Xing Liu (UC Berkeley) and Anthony Kremin (Berkeley Lab), June 2024

DESI's early data release (EDR) is available to the public, free of charge, at the desidata S3 cloud storage "bucket" on Amazon Web Services (AWS).

Here, we provide a Docker image which makes it easy to work with both local and cloud-hosted DESI data. Our Docker image is a self-contained Linux environment which comes pre-packaged with

Most DESI code developed for NERSC can run on this Docker image with little to no modifications.

Available options \ You are free to choose a combination of local/cloud-hosted databases and local/cloud-hosted programming environments to suit your workflow.

If your DESI data is hosted locally, or if you want to stream the S3 DESI data to process locally, then please follow the instructions at Running the Docker image locally. We emphasize that local data processing is only practical for those with high-performance computers. Due to the high resolution of DESI data, you should only run the image locally if your computer has at least 16 GB of memory (24 GB recommended).

Otherwise, we recommend running the Docker image at your institution's computing center, or a commercial cloud computing center such as AWS Elastic Cloud Compute (EC2). A cloud compute instance gives you on-demand access to additional storage and processing power. AWS EC2, in particular, have a very high-bandwidth internal network integration with AWS S3. If you are interested, then please follow the instructions for Running the Docker image on an AWS EC2 cloud compute instance.

Running the Docker image locally

System requirements

Step 1. Installing Docker

We will be using Docker Engine, Docker's command-line tool.

Step 2. Running the image

Open your computer terminal, and navigate to the folder you use as your workspace for DESI.

If your DESI data is locally hosted at local_data_path, then enter this command:

docker run -it -p 8888:8888 -e DESI_RELEASE=edr \
  --volume "$(pwd):/home/synced" \
  --volume "local_data_path:/home/desidata:ro" \
  ghcr.io/desihub/desidocker:main

Otherwise, to access the DESI data hosted at AWS S3, then enter this command instead:

docker run -it -p 8888:8888 -e DESI_RELEASE=edr \
  --volume "$(pwd):/home/synced" \
  --cap-add SYS_ADMIN --device /dev/fuse --security-opt apparmor:unconfined \
  ghcr.io/desihub/desidocker:main

Once the image starts running, locate the line beginning with http://127.0.0.1:8888/lab?token=... in the output, and open the address in your browser.

Running the Docker image on AWS EC2

Step 1. Creating an account

While you do not need an AWS account to access the DESI data locally, you do have to make one in order to use the AWS EC2 service. Follow the official instructions for First time users of AWS to get started. Once you’ve signed into your account, we recommend switching your region to us-west-2 (Oregon) as that is the region of our S3 bucket. Then, you can navigate to Services » EC2 to set-up a cloud compute instance.

Step 2. Creating a security group

To access the Jupyter web server provided by our Docker image, first we need to create a security group which allows HTTPS network access.

Navigate to Services » EC2 » Security groups, then click Create security group. Fill in the following fields —

  1. Basic details: Name the security group jupyter.
  2. Inbound rules: Add the following rules —
Type Protocol Port range Source type Source Description
Custom TCP (TCP) 8888 My IP (Your IP) Open TCP port for Jupyter server
HTTPS (TCP) (443) My IP (Your IP) Allow HTTPS for Jupyter server
SSH (TCP) (22) My IP (Your IP) Allow SSH access to the instance
  1. Outbound rules: Add the following rule (if it isn't already there) —
Type Protocol Port range Source type Source Description
All traffic (All) (All) Anywhere-IPv4 (0.0.0.0/0) Allow instance to access the whole internet

Then click Create security group.

Step 3. Launching an instance

Navigate to Services » EC2 » Instances, then click Launch instances. Fill in the following fields —

  1. Name and tags: Pick your own.
  2. Application and OS Images (Amazon Machine Image): We recommend selecting Amazon Linux, although Ubuntu and other Linux distributions should also work.
  3. Instance type: We recommend starting with t3.xlarge or t3.2xlarge, due to the memory-intensive nature of processing DESI data. You should upgrade to other instances if you need more processing power and memory.
  4. Key pair: Create your own and save the private key file.
  5. Network settings: Select the jupyter security group we created earlier.
  6. Configure storage: For free-tier accounts, we recommend the maximum available 30 GiB. There can be a lot of locally cached DESI data!

Then click Launch instance. After the instance has loaded, follow the official instructions to Connect to your instance.

Step 4. Installing Docker on the instance

Run the following lines to install Git and Docker on Amazon Linux, which uses the yum package management system.

# Install Git and Docker
sudo yum update
sudo yum install git
sudo yum install docker
# Give Docker extra permissions
sudo usermod -a -G docker ec2-user
id ec2-user
newgrp docker
sudo systemctl enable docker.service

If you are using a different Linux distribution on your instance, refer to the official instructions to install Docker Engine for Linux instead.

Step 5. Running the image

Run this command to start Docker,

sudo systemctl start docker.service

Finally, run this shell command to download and run the image.

docker run -it -p 8888:8888 -e DESI_RELEASE=edr \
  -e PUBLIC_IP=$(curl -s https://checkip.amazonaws.com) \
  --volume "$(pwd):/home/synced" \
  --cap-add SYS_ADMIN --device /dev/fuse --security-opt apparmor:unconfined \
  ghcr.io/desihub/desidocker:main

Once the image starts running, locate the line beginning with http://...:8888/lab?token=... in the output, and open the address in your browser.

Customizations

Updating the Docker image

To update your Docker image, run

docker pull ghcr.io/desihub/desidocker:main

Maintaining this project

See maintainance.md.