cloudyr / googleComputeEngineR

An R interface to the Google Cloud Compute API, for launching virtual machines
https://cloudyr.github.io/googleComputeEngineR/
Other
152 stars 41 forks source link

Update rstudio-gpu image #179

Open anirban-mukherjee opened 3 years ago

anirban-mukherjee commented 3 years ago

Describe the bug In gce_vm, option 'template = "rstudio-gpu"' leads to the launch of an instance using the deprecated rocker/ml-gpu image. While this still image works, all libraries and software on the instance are outdated and the instance has been officially deprecated. Please see: https://hub.docker.com/r/rocker/ml-gpu.

The current image is here: https://hub.docker.com/r/rocker/ml. Would it be possible to request for googleComputeEngineR to be updated to use the new image. As far as I can tell, rocker/ml should come with all GPU libraries and should be a drop-in replacement.

To Reproduce Launch an instance using gce_vm and 'template = "rstudio-gpu"'.

Expected behavior Instance with recent R and TF libraries.

MarkEdmondson1234 commented 3 years ago

Yes I will look at updating these, and see if there is a way to always use the latest version.

anirban-mukherjee commented 3 years ago

Great, thanks so much! AFAIK if googleComputeEngineR uses the names listed at https://github.com/rocker-org/rocker-versioned2, it should always be the latest. I think the deprecation of ml-gpu was one off and occurred at the same time as the move to a new build system and R 4.0.0.

MarkEdmondson1234 commented 3 years ago

Can you try it now with the github version?

anirban-mukherjee commented 3 years ago

Thanks for the quick response! The instance seems to launch fine. I see the instance in gce_list_instances and in Google cloud console. I am able to connect to the instance using SSH from Google cloud console. gcer_docker_image seems correct, install-nvidia-driver is True. Both googleComputeEngineR and the console give me the same external ip. The firewall rules seem fine.

But when I go to the external ip using a browser, I don't see a login screen. I do see the RStudio icon on the url bar, but I cannot connect to Rstudio server (Safari says "failed to open page" and I get a blank screen -- no username/login boxes).

Any ideas on what might be going wrong? On my end, I setup GCE_AUTH_FILE, GCE_DEFAULT_PROJECT_ID, and GCE_DEFAULT_ZONE, before calling gce_vm. Below is the command I used. This command works fine for template = "rstudio".

vm <- gce_vm(template = "rstudio-gpu", name = "rstudio-server", disk_size_gb = 1000, predefined_type="n1-highmem-16", username = "bleh", password = "bleh", acceleratorCount = 1, acceleratorType = "nvidia-tesla-p100")

Many thanks and sorry for the bother!

MarkEdmondson1234 commented 3 years ago

I'll have to review it, perhaps changing the image means some configuration has also changed.

anirban-mukherjee commented 3 years ago

sudo apt install (for anything) gives me:

E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?

pgrep -a apt gives me

apt -y upgrade

I am not sure what's running apt -y upgrade in the startup script (I didn't run it) but either the command is failing and leaving apt locked or the command is still running. apt list --upgradable gives me:

Listing... Done
docker-ce/stretch 5:19.03.15~3-0~debian-stretch amd64 [upgradable from: 18.06.1~ce~3-0~debian]
google-cloud-packages-archive-keyring/google-cloud-packages-archive-keyring-stretch 1.2-391078977 all [upgradable from: 1.2-1]
google-cloud-sdk/cloud-sdk-stretch 353.0.0-0 all [upgradable from: 216.0.0-0]
google-compute-engine/google-compute-engine-stretch-stable 1:20210629.00-g1 all [upgradable from: 2.8.4-1]
google-compute-engine-oslogin/google-compute-engine-stretch-stable 1:20210728.00-g1+deb9 amd64 [upgradable from: 1.3.1-1+deb9]
kubectl/cloud-sdk-stretch,kubernetes-xenial 1.22.0-00 amd64 [upgradable from: 1.11.3-00]
nvidia-container-runtime/stretch 3.5.0-1 amd64 [upgradable from: 2.0.0+docker18.06.1-1]
nvidia-docker2/stretch 2.6.0-1 all [upgradable from: 2.0.3+docker18.06.1-1]
python-google-compute-engine/google-compute-engine-stretch-stable 1:20191210.00-g1 all [upgradable from: 2.8.4-1]
python3-google-compute-engine/google-compute-engine-stretch-stable 1:20191210.00-g1 all [upgradable from: 2.8.4-1]

That seems like a lot of updates but its been an hour and I would have expected it to complete by now. I will leave the instance running and let you know if I am able to access it (in which case, its a matter of having to wait for apt upgrade to finish) or if it's still inaccessible over http.

MarkEdmondson1234 commented 3 years ago

It could very well take longer than an hour if the VM is a small one with limited CPU. The startup scripts may well run updates and I guess they need to use more up to date installs.

anirban-mukherjee commented 3 years ago

I don't think that's it. I waited for well over an hour and still no dice. In that period, CPU utilisation, network traffic, disk IO -- all were basically 0. There is no signs that the instance was doing anything. Also, this is a 8-core (16 vcpu) Skylake server with 60 GB of RAM so it should not take that long.

MarkEdmondson1234 commented 3 years ago

Thanks for the info - I will keep an eye on it.

anirban-mukherjee commented 3 years ago

I poked around more. I don't think the issue is with the rocker ml image. Looking at the description on the rocker webpage, it looks like the image is kept updated. I think the dockerhub description is likely out of date (it says the image has CUDA 10.1 for example), which led me to believe that the rocker image is outdated but it doesn't make sense for those elements to be outdated. I have deleted the inaccurate parts of my comments above.

I looked at the googleComputeEngineR code (gpus.r) and inside gce_vm_gpu I see:

  if(is.null(dots$image_family)){
    dots$image_family <- "tf-latest-cu92"
  }

Do you think that might be the problem? Running these three commands below shows the creation date for tf-latest-cu92 is in 2018. tf2-2-6-cu110 is the latest image (list available at https://cloud.google.com/deep-learning-vm/docs/images). The latest image can be gotten by using tf2-ent-latest-gpu.

gce_get_image_family("deeplearning-platform-release", "tf-latest-cu92")
gce_get_image_family("deeplearning-platform-release", "tf2-2-6-cu110")
gce_get_image_family("deeplearning-platform-release", "tf2-ent-latest-gpu")

I tried passing image_family to gce_vm_gpu by doing vm <- gce_vm_gpu(template = "rstudio-gpu", name = "rstudio-server-gpu", disk_size_gb = 1000, predefined_type="n1-standard-16", username = "###", password = "###", image_family="tf2-ent-latest-gpu", image_project ="deeplearning-platform-release", acceleratorType = "nvidia-tesla-p100") but I got:

2021-08-19 00:26:47> Launching VM with GPU support. If using docker_cmd() functions make sure to include nvidia=TRUE parameter
2021-08-19 00:26:48> Creating template VM
2021-08-19 00:26:48> Launching VM with GPU support. If using docker_cmd() functions make sure to include nvidia=TRUE parameter
ℹ 2021-08-19 00:26:53 > Request Status Code:  400
Error: API returned: Invalid value for field 'resource.metadata': '{  "item": [{    "key": "startup-script",    "value": "#!/bin/bash\necho \"Docker RStudio GPU launch...'. Metadata has duplicate keys: install-nvidia-driver

Hence, I have not been able to test if changing the image_family changes things. Fingers crossed that this is the issue though.

MarkEdmondson1234 commented 3 years ago

Yes that may be it, I'll update it and test myself too. Thanks for info.

MarkEdmondson1234 commented 3 years ago

I bumped up the image_family, could you try it?

anirban-mukherjee commented 3 years ago

Ok. Will let you know.

I think we also need the same change in gce_vm_template in templates.r. I caught this earlier when playing with the source code locally.

 # hack for gpu until nvidia-docker is supported on cos-cloud images
  if(grepl("gpu$", template)){
    # setup GPU specific options
    dots            <- set_gpu_template(dots)
    ss_file         <- get_template_file(template, "startupscripts")
    startup_script  <- readChar(ss_file, nchars = file.info(ss_file)$size)
    cloud_init_file <- NULL
    image_family    <- "tf2-ent-latest-gpu"
    image_project   <- "deeplearning-platform-release"
  } else {
    # creates cloud-config file that will call the startup script
    cloud_init_file <- read_cloud_init_file(template)
    startup_script  <- NULL
    image_project   <-  "cos-cloud"
  }
anirban-mukherjee commented 3 years ago

Unfortunately, I think making those two changes is not sufficient. The next problem is that the image from Google is Debian and Rocker is based on Ubuntu. In particular, the home page says: "Images are now based on Ubuntu LTS releases rather than Debian and system libraries are tied to the Ubuntu version. Images will use the most recent LTS available at the time when the corresponding R version was released. Thus all 4.0.0 images are based on Ubuntu 20.04." This would bork the startup scripts for obvious reasons. I don't see Ubuntu 20.04 libraries in the Google repository. I see Ubuntu 18.04 (in particular, tf2-ent-2-6-cpu-v20210818-ubuntu-1804) but that's asking for trouble downstream if the setup script is for Ubuntu 20.04 and one builds off Ubuntu 18.04.

MarkEdmondson1234 commented 3 years ago

The latest commit catches that

anirban-mukherjee commented 3 years ago

TLDR: things somewhat work but require a bit of elbow grease, patience, and also tolerance for noise.

Running this command

vm <- gce_vm (
  name = "rstudio-server-gpu",
  template = "rstudio-gpu",
  predefined_type = "n1-standard-16",
  disk_size_gb = 1000,
  acceleratorCount = 1,
  acceleratorType = "nvidia-tesla-p100",
  username = "###",
  password = "###"
)

gives me:

2021-08-19 07:34:36> Creating template VM
2021-08-19 07:34:36> Launching VM with GPU support. If using docker_cmd() functions make sure to include nvidia=TRUE parameter
ℹ 2021-08-19 07:34:39 > Request Status Code:  404
Error: API returned: The resource 'projects/deeplearning-platform-release/global/images/family/cos-stable' was not found

That does not work. Running this command (i.e., specifying image_project and image_family)

vm <- gce_vm (
  name = "rstudio-server-gpu",
  template = "rstudio-gpu",
  predefined_type = "n1-standard-16",
  image_project = "deeplearning-platform-release",
  image_family = "tf2-ent-latest-gpu",
  disk_size_gb = 1000,
  acceleratorCount = 1,
  acceleratorType = "nvidia-tesla-p100",
  username = "###",
  password = "###"
)

gets me an instance. The RStudio login window comes up in reasonable time. I am able to login. The instance is on Ubuntu 20.04 and not on Debian. The specifications seem correct.

I did not see TensorFlow (I couldn't find it, pip does not list it and I don't see any place that there could be a virtualenv with it). I did see CUDA 10.1, which is outdated, but any attempt to update CUDA is just asking for trouble since this is a container and so several files are read only (they are injected by docker). I tried and then realised that this was not possible.

CUDA 10.1 limits tensorflow to 2.2. Apart from that, one has to be careful not to run apt update as that tries to update things that don't update and the entire system falls apart. So as long as one only pip installs TF 2.2 and then only installs the R libraries, I think this maybe a feasible route. That said, I would not recommend it for anyone not familiar with the entire software stack. As for me, I am going to stick to the CPU computing route because the CUDA implementation of GRU at that time did not have dropout (don't ask me why ... its bizarre to me), which forces TF to use the CPU and then what's the point of having a GPU at all!

I am going to close this issue again because I really don't think there is anything left to do from the perspective of googleComputeEngineR.

MarkEdmondson1234 commented 3 years ago

I will log in my own instance and investigate and perhaps ask Qs on the rocker/ml GitHub so will keep it open :)

anirban-mukherjee commented 3 years ago

Super, thanks! Do let me know if I can help test anything!

MarkEdmondson1234 commented 3 years ago

One useful thing would be to run it with options(googleAuthR.verbose=2) to see more logging.

anirban-mukherjee commented 3 years ago
> vm <- gce_vm (
+   name = "rstudio-server-gpu",
+   template = "rstudio-gpu",
+   predefined_type = "n1-standard-16",
+   image_project = "deeplearning-platform-release",
+   image_family = "tf2-ent-latest-gpu",
+   disk_size_gb = 1000,
+   acceleratorCount = 1,
+   acceleratorType = "nvidia-tesla-p100",
+   username = "###",
+   password = "###"
+ )
ℹ 2021-08-19 10:12:15 > Token exists.
ℹ 2021-08-19 10:12:15 > Request:  https://www.googleapis.com/compute/v1/projects/###/zones/###/instances/?filter=status%20eq%20TERMINATED
2021-08-19 10:12:15> Creating template VM
2021-08-19 10:12:15> Launching VM with GPU support. If using docker_cmd() functions make sure to include nvidia=TRUE parameter
2021-08-19 10:12:15> Run gce_startup_logs(your-instance, 'shell') to track startup script logs
ℹ 2021-08-19 10:12:15 > Token exists.
ℹ 2021-08-19 10:12:15 > Request:  https://www.googleapis.com/compute/v1/projects/###/global/networks/default/
ℹ 2021-08-19 10:12:16 > Token exists.
ℹ 2021-08-19 10:12:16 > Request:  https://www.googleapis.com/compute/v1/projects/deeplearning-platform-release/global/images/family/tf2-ent-latest-gpu/
ℹ 2021-08-19 10:12:16 > Token exists.
ℹ 2021-08-19 10:12:16 > Request:  https://www.googleapis.com/compute/v1/projects/###/zones/###/instances/
ℹ 2021-08-19 10:12:16 > Body JSON parsed to:  {"machineType":"zones/###/machineTypes/n1-standard-16","metadata":{"items":[{"key":"startup-script","value":"#!/bin/bash\necho \"Docker RStudio GPU launch script\"\n# not done via cloud-init as not on container-os image for now\n\nRSTUDIO_USER=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/rstudio_user -H \"Metadata-Flavor: Google\")\nRSTUDIO_PW=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/rstudio_pw -H \"Metadata-Flavor: Google\")\nGCER_DOCKER_IMAGE=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/gcer_docker_image -H \"Metadata-Flavor: Google\")\n\necho \"Docker image: $GCER_DOCKER_IMAGE\"\n\necho \"GPU settings\"\nls -la /dev | grep nvidia\nnvidia-smi\n\nnvidia-docker run -p 80:8787 \\\n           -e ROOT=TRUE \\\n           -e USER=$RSTUDIO_USER -e PASSWORD=$RSTUDIO_PW \\\n           -d \\\n           --name=rstudio-gpu \\\n           --restart=always \\\n           $GCER_DOCKER_IMAGE\n"},{"key":"template","value":"rstudio-gpu"},{"key":"google-logging-enabled","value":"true"},{"key":"rstudio_user","value":"###"},{"key":"rstudio_pw","value":"###"},{"key":"gcer_docker_image","value":"rocker/ml"},{"key":"install-nvidia-driver","value":"True"}]},"name":"rstudio-server-gpu","disks":[{"initializeParams":{"sourceImage":"projects/deeplearning-platform-release/global/images/tf2-ent-latest-gpu-v20210818","diskSizeGb":"1000"},"autoDelete":true,"boot":true,"type":"PERSISTENT","deviceName":"rstudio-server-gpu-boot-disk"}],"networkInterfaces":[{"network":"https://www.googleapis.com/compute/v1/projects/###/global/networks/default","accessConfigs":[{"type":"ONE_TO_ONE_NAT"}]}],"scheduling":{"onHostMaintenance":"TERMINATE","automaticRestart":true},"serviceAccounts":[{"email":"###","scopes":["https://www.googleapis.com/auth/cloud-platform"]}],"guestAccelerators":[{"acceleratorCount":1,"acceleratorType":"projects/###/zones/###/acceleratorTypes/nvidia-tesla-p100"}],"tags":{"items":["http-server","rstudio"]}}
2021-08-19 10:12:18> Starting operation...
ℹ 2021-08-19 10:12:18 > Token exists.
ℹ 2021-08-19 10:12:18 > Request:  https://www.googleapis.com/compute/v1/projects/###/zones/###/operations/operation-1629382337932-5c9ea2374abbd-871679b5-9f1ad25d/
2021-08-19 10:12:18> Operation running...
ℹ 2021-08-19 10:12:28 > Token exists.
ℹ 2021-08-19 10:12:28 > Request:  https://www.googleapis.com/compute/v1/projects/###/zones/###/operations/operation-1629382337932-5c9ea2374abbd-871679b5-9f1ad25d/
2021-08-19 10:12:29> Operation running...
ℹ 2021-08-19 10:12:39 > Token exists.
ℹ 2021-08-19 10:12:39 > Request:  https://www.googleapis.com/compute/v1/projects/###/zones/###/operations/operation-1629382337932-5c9ea2374abbd-871679b5-9f1ad25d/
2021-08-19 10:12:39> Operation running...
ℹ 2021-08-19 10:12:49 > Token exists.
ℹ 2021-08-19 10:12:49 > Request:  https://www.googleapis.com/compute/v1/projects/###/zones/###/operations/operation-1629382337932-5c9ea2374abbd-871679b5-9f1ad25d/
2021-08-19 10:12:59> Operation complete in 29 secs
ℹ 2021-08-19 10:12:59 > Token exists.
ℹ 2021-08-19 10:12:59 > Request:  https://www.googleapis.com/compute/v1/projects/###/zones/###/instances/rstudio-server-gpu/
2021-08-19 10:13:00> ## VM Template: 'rstudio-gpu' running at http://###
2021-08-19 10:13:00> On first boot, wait a few minutes for docker container to install before logging in.
==Google Compute Engine Instance==

Name:                rstudio-server-gpu
Created:             2021-08-19 07:12:18
Machine Type:        n1-standard-16
Status:              RUNNING
Zone:                ###
External IP:         ###
Disks: 
                    deviceName       type       mode boot autoDelete
1 rstudio-server-gpu-boot-disk PERSISTENT READ_WRITE TRUE       TRUE

Metadata:  
                     key             value
2               template       rstudio-gpu
3 google-logging-enabled              true
4           rstudio_user           ###
5             rstudio_pw ###
6      gcer_docker_image         rocker/ml
7  install-nvidia-driver              True
ℹ 2021-08-19 10:13:00 > Token exists.
ℹ 2021-08-19 10:13:00 > Request:  https://www.googleapis.com/compute/v1/projects/###/global/firewalls/?
2021-08-19 10:13:00> http firewall exists: allow-http
ℹ 2021-08-19 10:13:00 > Token exists.
ℹ 2021-08-19 10:13:00 > Request:  https://www.googleapis.com/compute/v1/projects/###/global/firewalls/allow-http/
2021-08-19 10:13:00> https firewall exists: allow-https
ℹ 2021-08-19 10:13:00 > Token exists.
ℹ 2021-08-19 10:13:00 > Request:  https://www.googleapis.com/compute/v1/projects/###/global/firewalls/allow-https/
2021-08-19 10:13:01> rstudio-server-gpu VM running
anirban-mukherjee commented 3 years ago

I worked on this issue from a different angle. I started from the deep learning vm from Google, installed RStudio server and R, and setup remote access permissions. Everything works as expected.

TBH, I am unsure of what's going on. The deep learning vm is Debian and comes pre-installed with TF on conda and CUDA 11.x. I am not sure I understand how/why using rocker we end up with a Ubuntu vm with CUDA 10.x and Python/TF from apt? What then is the role of image_project and image_family? It seems not much. In that case, shouldn't we start with a plain vanilla Ubuntu image because the kernel, CUDA, and TF, are all not ending up in the final vm anyways? In fact, the simplest setup likely is to start with the deep learning vm and then only setup RStudio, R, and permissions. The user then only has to point R to the conda Python and everything works as expected. Specifically, this is all one needs after this setup:

Sys.setenv(RETICULATE_PYTHON = "/opt/conda/bin/python3")