Azure / azureml-sdk-for-r

Azure Machine Learning SDK for R
https://azure.github.io/azureml-sdk-for-r/
Other
105 stars 40 forks source link

Replicating the Dockerfile produces errors #398

Closed hermandr closed 4 years ago

hermandr commented 4 years ago

Describe the bug Replicating the docker image from this article How to create custom docker base images for azure machine learning environments produces errors.

1 | FROM mcr.microsoft.com/azureml/base:openmpi3.1.2-ubuntu16.04
2 | RUN conda install -c r -y \
3 | r-essentials=3.6.0 \
4 | r-reticulate \
5 | rpy2 \
6 | r-remotes \
7 | r-rodbc \
8 | r-e1071 \
9 | r-optparse && \
10 | conda clean -ay && pip install --no-cache-dir azureml-defaults
11 | RUN apt-get update && apt-get install -y \
12 | tzdata \
13 | zlib1g-dev && \
14 | apt-get clean
15 |  
16 | ENV TAR="/bin/tar"
17 |  
18 | # Set default locale
19 | ENV LANG C.UTF-8
20 |  
21 | # Set default timezone
22 | ENV TZ UTC
23 |  
24 | RUN R -e "remotes::install_github('https://github.com/Azure/azureml-sdk-for-r')"
25 | RUN R -e "azuremlsdk::install_azureml(version = '1.10.0', remove_existing_env = TRUE)"

No error during docker build.

Based on the accidents example on the Vignettes, when I ran R code on the cluster I encountered this error

R code snippet where error occured

  message("Log metrics on azureml")

  log_metric_to_run("Accuracy",
                    calc_acc(actual = accident_tst$dead,
                             predicted = predict(accident_glm_mod, newdata = accident_tst))
  )
  log_metric_to_run("Method","GLM")
  log_metric_to_run("TrainPCT",train.pct)

Error output

Log metrics on azureml
Error in py_get_attr_impl(x, name, silent) : 
  AttributeError: module 'azureml' has no attribute 'core'
Calls: log_metric_to_run ... py_get_attr_or_item -> py_get_attr -> py_get_attr_impl
Execution halted
2020/10/20 04:15:31 logger.go:297: Failed to run the wrapper cmd with err: exit status 1
2020/10/20 04:15:31 logger.go:297: Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/status
2020/10/20 04:15:31 sysutils_linux.go:221: mpirun version string: {
mpirun (Open MPI) 3.1.2

Report bugs to http://www.open-mpi.org/community/help/
}

When I use the Dockerfile:

FROM lucazav/r-sdk-docker-img:v3
# Check that it can access azureml and azureml.core modules
RUN R -e "reticulate::py_module_available('azureml');  reticulate::py_module_available('azureml.core')"

No errors were encountered.

It is important for my client that the image is taken from the enterprise private ACR. This seems to be an older issue appearing again.

To Reproduce Steps to reproduce the behavior:

  1. Copy the Dockerfile from How to create custom docker base images for azure machine learning environments
  2. Run the sample code from Vignettes on accidents
  3. Fail and error where it tries log_metric()
  4. Use "FROM lucazav/r-sdk-docker-img:v3" to take the pre-built image (built around March 2020)
  5. Run the sample code

Expected behavior The copy of the Dockerfile should work for log_metric()

Screenshots If applicable, add screenshots to help explain your problem. None

Additional context Add any other context about the problem here. Vignette experiments-deep-dive.Rmd

diondrapeck commented 4 years ago

I'm tagging @lucazav since he authored this article.

lucazav commented 4 years ago

Hi @hermandr ,

that article was written few months ago. It could be that the current R SDK version is not compatible with the old Python SDK version 1.10.0. According to this page, the latest Python SDK version is the 1.16.0 one. Try to replace your Dockerfile's last row with the following one:

RUN R -e "azuremlsdk::install_azureml(version = '1.16.0', remove_existing_env = TRUE)"

and then re-build your Docker image.

lucazav commented 4 years ago

@hermandr as the Dockerfile you are testing doesn't have any particular package to be installed, I suppose your code works fine with the default Docker image. Isn't it?

hermandr commented 4 years ago

@hermandr as the Dockerfile you are testing doesn't have any particular package to be installed, I suppose your code works fine with the default Docker image. Isn't it?

log_metric does not work. Please try to create v4 and re-build your image with latest versions of azureml and test to see if you get same result as mine. I believe when I use "FROM..." it just uses the pre-built image of your v3 build. But when I build with same Dockerfile, it failed (log_metric) and probably due to changes in the azureml sdk since your build date.

Herman

hermandr commented 4 years ago

Hi @hermandr ,

that article was written few months ago. It could be that the current R SDK version is not compatible with the old Python SDK version 1.10.0. According to this page, the latest Python SDK version is the 1.16.0 one. Try to replace your Dockerfile's last row with the following one:

RUN R -e "azuremlsdk::install_azureml(version = '1.16.0', remove_existing_env = TRUE)"

and then re-build your Docker image.

@lucazav Sorry I missed this post. Let me try with sdk v1.16.0. Let me post the update on the result here.

Herman

hermandr commented 4 years ago

Hi @lucazav,

I can confirm that with v1.16.0 error still occurs

Log metrics on azureml
Error in py_get_attr_impl(x, name, silent) : 
  AttributeError: module 'azureml' has no attribute 'core'
Calls: log_metric_to_run ... py_get_attr_or_item -> py_get_attr -> py_get_attr_impl
Execution halted
2020/10/22 14:17:59 logger.go:297: Failed to run the wrapper cmd with err: exit status 1
2020/10/22 14:17:59 logger.go:297: Attempt 1 of http call to http://10.0.0.4:16384/sendlogstoartifacts/status
2020/10/22 14:17:59 sysutils_linux.go:221: mpirun version string: {
mpirun (Open MPI) 3.1.2

Report bugs to http://www.open-mpi.org/community/help/
}
lucazav commented 4 years ago

Hi @hermandr,

if so, it'd be a bug of the latest release. The R SDK PM told me there are a bunch of bugs in the 1.10.0 version released on CRAN to be fixed.

@diondrapeck could you investigate on this bug, please? I think it doesn't depend on the custom Docker image, as it simply install the latest versions of both the SDKs.

Thank you.

lucazav commented 4 years ago

@hermandr could you please install the latest version of SDKs on your Compute Instance using RStudio in this way:

remotes::install_github('https://github.com/Azure/azureml-sdk-for-r')
azuremlsdk::install_azureml(version = '1.16.0', remove_existing_env = TRUE)

and then try to run your code using the default Docker image? If it still fails, it's a confirmation that the bug is in the SDKs.

Thank you.

hermandr commented 4 years ago
  1. Created a new compute instance

    Virtual machine size
    STANDARD_D2_V3 (2 Cores, 8 GB RAM, 50 GB Disk)
    Processing Unit
    CPU - General purpose
  2. In compute instance RStudio check default versions of azureml sdk {F1D4BA21-8684-414C-BA64-ED1F8A9099F3}

  3. RStudio install latest versions of azureml sdk for R and Python

    remotes::install_github('https://github.com/Azure/azureml-sdk-for-r')
    azuremlsdk::install_azureml(version = '1.16.0', remove_existing_env = TRUE)
  4. Check versions of sdk in R and Python after installation image

  5. Upload minimal estimator script to run only log_metric Submit experiment code:

    
    library(azuremlsdk)

setwd("~/cloudfiles/code/Users/oratsl")

sp_auth <- service_principal_authentication( tenant_id = Sys.getenv("TENANT_ID"), service_principal_id = Sys.getenv("SERVICE_PRINCIPAL_ID"), service_principal_password = Sys.getenv("SERVICE_PRINCIPAL_PASSWORD") )

ws <- get_workspace( "hermanml", subscription_id = "1dbf72ea-fdeb-46cb-a58f-b873e8f2ae4e", resource_group = "Machine-Learning", auth = sp_auth )

Find the compute target

cluster_name <- "ml-compute" compute_target <- get_compute(ws, cluster_name = cluster_name) if(is.null(compute_target)) stop("Training cluster not found")

exp <- experiment(ws, "minimal")

est_minimal <- estimator(source_directory="minimal", entry_script = "minimal_estimator.R", script_params = list("--note" = "hermantansg/r-sdk-docker-img:default", "--instance" = "cloud"), compute_target = compute_target)

run <- submit_experiment(exp, est_minimal)


Estimator code:

' Copyright(c) Microsoft Corporation.

' Licensed under the MIT license.

This is the code to be run on a node in compute cluster

message("libs") library(azuremlsdk) library(optparse)

library(dplyr) library(purrr) library(tidyr)

Debug

library(reticulate) message("Check python config") tibble(p = list(py_discover_config())) %>% mutate(python=map_chr(p,"python"), libpython=map_chr(p,"libpython"), pythonhome=map_chr(p,"pythonhome"), virtualenv=map_chr(p,"virtualenv"), virtualenv_activate=map_chr(p,"virtualenv_activate"), version_string=map_chr(p,"version_string"), version=map_chr(p,"version"), architecture=map_chr(p,"architecture"), annaconda=map_lgl(p,"anaconda"), numpy=map(p,"numpy"), numpy=map_chr(numpy,"path"), python_versions=map(p,"python_versions"), python_versions = map_chr(python_versions,~paste(.x,collapse=":")) ) %>% select(-p) %>% gather(key="py_config_parameter", value="value") %>% unite(s, py_config_parameter,value,sep=": ", remove = TRUE) %>% as.matrix() %>% write(.,stderr())

message("Check environments") conda_list() message("Check if azureml is accessible") py_module_available("azureml") message("Check if azureml.core is accessible") py_module_available("azureml.core")

message("List of python modules and versions") system("pip list")

###################################

message("optparse add options") options <- list( make_option(c("-n", "--note"), action="store", dest="note",default="No notes", help="Note on submit"), make_option(c("-i", "--instance"), action="store", dest="instance", default="local", help="Location of compute instance local or azureml") )

message("OptionParser") opt_parser <- OptionParser(option_list = options) opt <- parse_args(opt_parser)

message("Submit note:", opt$note)

Log metrics to azuremlsdk only when on cloud

if (opt$instance != "local") { message("Log metrics on azureml") log_metric_to_run("Method","GLM") }

message("End of run")

message("Session Info") sessionInfo()


[minimal-default.zip](https://github.com/Azure/azureml-sdk-for-r/files/5424990/minimal-default.zip)

6. Run the submit code

7. Wait for estimator to complete and log results:
No error, successful run
[70_driver_log (1).txt](https://github.com/Azure/azureml-sdk-for-r/files/5425009/70_driver_log.1.txt)

Docker build log:
[20_image_build_log.txt](https://github.com/Azure/azureml-sdk-for-r/files/5425012/20_image_build_log.txt)

*Summary:*
1. Submit code is running latest version of azureml sdk on python and R in compute instance RStudio
2. Compute cluster running the default image runs successfully
hermandr commented 4 years ago

@lucavaz your Dockerfile based on v3 and I changed to azuremlsdk v1.16:

FROM mcr.microsoft.com/azureml/base:openmpi3.1.2-ubuntu16.04
RUN conda install -c r -y \
  r-essentials=3.6.0 \
  r-reticulate \
  rpy2 \
  r-remotes \
  r-rodbc \
  r-e1071 \
  r-optparse && \
  conda clean -ay && pip install --no-cache-dir azureml-defaults
RUN apt-get update && apt-get install -y \
  tzdata \
  zlib1g-dev && \
  apt-get clean

ENV TAR="/bin/tar"

# Set default locale
ENV LANG C.UTF-8

# Set default timezone
ENV TZ UTC

RUN R -e "remotes::install_github('https://github.com/Azure/azureml-sdk-for-r')"
RUN R -e "azuremlsdk::install_azureml(version = '1.16.0', remove_existing_env = TRUE)"

From 20_image_build_log.txt of default Docker build, I reconstruct the dockerfile

FROM mcr.microsoft.com/azureml/base:openmpi3.1.2-ubuntu16.04@sha256:8bc7ffc7142fb2914e40e8d64fed7bb89f7d087b670c0cb3168d241a5e908e98
USER root
RUN mkdir -p $HOME/.cache
WORKDIR /
COPY azureml-environment-setup/99brokenproxy /etc/apt/apt.conf.d/
RUN if dpkg --compare-versions `conda --version | grep -oE '[^ ]+$'` lt 4.4.11; then conda install conda==4.4.11; fi
COPY azureml-environment-setup/mutated_conda_dependencies.yml azureml-environment-setup/mutated_conda_dependencies.yml
RUN ldconfig /usr/local/cuda/lib64/stubs && \
    conda env create -p /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad -f azureml-environment-setup/mutated_conda_dependencies.yml && \
    rm -rf "$HOME/.cache/pip" && \
    conda clean -aqy && \
    CONDA_ROOT_DIR=$(conda info --root) && \
    rm -rf "$CONDA_ROOT_DIR/pkgs" && \
    find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && \
    ldconfig
ENV PATH /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad/bin:$PATH
ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad
ENV LD_LIBRARY_PATH /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad/lib:$LD_LIBRARY_PATH
RUN conda install -p /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad -c r -y \
    r-essentials=3.6.0 \
    rpy2 \
    r-checkpoint && \
    pip install --no-cache-dir azureml-defaults
ENV TAR="/bin/tar"
RUN R -e "library(checkpoint); \
    snapshot_date <- tail(checkpoint::getValidSnapshots(), n = 1); \
    setSnapshot(snapshot_date); \
    install.packages(c('reticulate', 'remotes', 'e1071', 'optparse')); \
    library(remotes); \
    remotes::install_cran('azuremlsdk', upgrade = FALSE);"
COPY azureml-environment-setup/spark_cache.py azureml-environment-setup/log4j.properties /azureml-environment-setup/
RUN if [ $SPARK_HOME ]; then /bin/bash -c '$SPARK_HOME/bin/spark-submit  /azureml-environment-setup/spark_cache.py'; fi
ENV AZUREML_ENVIRONMENT_IMAGE True
CMD ["bash"]

Running docker build using this Dockerfile failed

\r-sdk-docker-image>docker build -t azureuser/r-sdk-docker-img .
[+] Building 0.1s (24/25)
 => [internal] load build definition from Dockerfile                                                                                                                                                                                    0.0s
 => => transferring dockerfile: 32B                                                                                                                                                                                                     0.0s
 => [internal] load .dockerignore                                                                                                                                                                                                       0.0s
 => => transferring context: 2B                                                                                                                                                                                                         0.0s
 => [internal] load metadata for mcr.microsoft.com/azureml/base:openmpi3.1.2-ubuntu16.04@sha256:8bc7ffc7142fb2914e40e8d64fed7bb89f7d087b670c0cb3168d241a5e908e98                                                                        0.0s
 => [1/22] FROM mcr.microsoft.com/azureml/base:openmpi3.1.2-ubuntu16.04@sha256:8bc7ffc7142fb2914e40e8d64fed7bb89f7d087b670c0cb3168d241a5e908e98                                                                                         0.0s
 => [internal] load build context                                                                                                                                                                                                       0.0s
 => => transferring context: 2B                                                                                                                                                                                                         0.0s
 => CACHED [2/22] RUN mkdir -p $HOME/.cache                                                                                                                                                                                             0.0s
 => ERROR [3/22] COPY azureml-environment-setup/99brokenproxy /etc/apt/apt.conf.d/                                                                                                                                                      0.0s
 => CACHED [4/22] RUN if dpkg --compare-versions `conda --version | grep -oE '[^ ]+$'` lt 4.4.11; then conda install conda==4.4.11; fi                                                                                                  0.0s
 => ERROR [5/22] COPY azureml-environment-setup/mutated_conda_dependencies.yml azureml-environment-setup/mutated_conda_dependencies.yml                                                                                                 0.0s
 => CACHED [6/22] RUN ldconfig /usr/local/cuda/lib64/stubs &&     conda env create -p /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad -f azureml-environment-setup/mutated_conda_dependencies.yml &&  rm -rf "$HOME/.cache/pip"  0.0s
 => CACHED [7/22] RUN conda install -c r -y   r-essentials=3.6.0   rpy2   r-checkpoint   r-remotes   r-rodbc   r-e1071   r-reticulate   r-optparse &&   conda clean -ay &&   pip install --no-cache-dir azureml-defaults                0.0s
 => CACHED [8/22] RUN apt-get update && apt-get install -y   tzdata   zlib1g-dev &&   apt-get clean                                                                                                                                     0.0s
 => CACHED [9/22] RUN R -e "remotes::install_github('https://github.com/Azure/azureml-sdk-for-r')"                                                                                                                                      0.0s
 => CACHED [10/22] RUN R -e "azuremlsdk::install_azureml(version = '1.16.0', remove_existing_env = TRUE)"                                                                                                                               0.0s
 => CACHED [11/22] RUN R -e "library(checkpoint);     snapshot_date <- tail(checkpoint::getValidSnapshots(), n = 1);  setSnapshot(snapshot_date)"                                                                                       0.0s
 => CACHED [12/22] RUN R -e "remotes::install_github('https://github.com/Azure/azureml-sdk-for-r')"                                                                                                                                     0.0s
 => CACHED [13/22] RUN R -e "azuremlsdk::install_azureml(version = '1.16.0', remove_existing_env = TRUE)"                                                                                                                               0.0s
 => CACHED [14/22] RUN apt-get install -y pkg-config                                                                                                                                                                                    0.0s
 => CACHED [15/22] RUN R -e "install.packages('data.table', repos='http://cran.rstudio.com/')"                                                                                                                                          0.0s
 => CACHED [16/22] RUN R -e "install.packages('xgboost', version='0.82.0.1', repos='http://cran.rstudio.com/')"                                                                                                                         0.0s
 => CACHED [17/22] RUN R -e "install.packages('tidyverse', repos='http://cran.rstudio.com/')"                                                                                                                                           0.0s
 => CACHED [18/22] RUN R -e "install.packages('caret', repos='http://cran.rstudio.com/')"                                                                                                                                               0.0s
 => CACHED [19/22] RUN R -e "install.packages('ggfortify', repos='http://cran.rstudio.com/')"                                                                                                                                           0.0s
 => ERROR [20/22] COPY azureml-environment-setup/spark_cache.py azureml-environment-setup/log4j.properties /azureml-environment-setup/                                                                                                  0.0s
------
 > [3/22] COPY azureml-environment-setup/99brokenproxy /etc/apt/apt.conf.d/:
------
------
 > [5/22] COPY azureml-environment-setup/mutated_conda_dependencies.yml azureml-environment-setup/mutated_conda_dependencies.yml:
------
------
 > [20/22] COPY azureml-environment-setup/spark_cache.py azureml-environment-setup/log4j.properties /azureml-environment-setup/:
------
failed to solve with frontend dockerfile.v0: failed to build LLB: failed to compute cache key: "/azureml-environment-setup/log4j.properties" not found: not found

The main reason this failed is because the /azureml-environment-setup folder is missing

Some how the default image is building from an image that has this folder with some environment setup files to set up the environment.

The 2 dockerfiles above have the same azuremlsdk in R and python versions but they have different environments.

The default image has the conda environment:

Check python config
python:  /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad/bin/python3
libpython:  /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad/lib/libpython3.6m.so
pythonhome:  /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad:/azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad
virtualenv:  
virtualenv_activate:  
version_string:  3.6.10 |Anaconda, Inc.| (default, Mar 23 2020, 23:13:11)  [GCC 7.3.0]
version:  3.6
architecture:  64bit
annaconda:  TRUE
numpy:  /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad/lib/python3.6/site-packages/numpy
python_versions:  /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad/bin/python3:/usr/bin/python3
Check environments
                                      name                                   python
1 azureml_da3e97fcb51801118b8e80207f3e01ad   /azureml-envs/azureml_da3e97fcb51801118b8e80207f3e01ad/bin/python
Check if azureml is accessible
[1] TRUE
Check if azureml.core is accessible
[1] TRUE

The custom image has the conda environment:

Check python config
python:  /opt/miniconda/bin/python3
libpython:  /opt/miniconda/lib/libpython3.7m.so
pythonhome:  /opt/miniconda:/opt/miniconda
virtualenv:  
virtualenv_activate:  
version_string:  3.7.7 (default, Mar 23 2020, 22:36:06)  [GCC 7.3.0]
version:  3.7
architecture:  64bit
annaconda:  FALSE
numpy:  /opt/miniconda/lib/python3.7/site-packages/numpy
python_versions:  /opt/miniconda/bin/python3:/usr/bin/python3
          name                                      python
1 r-reticulate /opt/miniconda/envs/r-reticulate/bin/python
Check if azureml is accessible
[1] TRUE
Check if azureml.core is accessible
[1] FALSE 
hermandr commented 4 years ago

UPDATE: I created code/azureml-environment-setup folder in the compute instance. I extracted the 3 files required by this docker file from the built image in the compute cluster and copied the files to the outputs folder in the train script.

system("cp /azureml-environment-setup/* ./outputs")
system("cp /etc/apt/apt.conf.d/* ./outputs")

{7685A907-4715-4FAD-A48C-A1F44995F957}

I ran build docker using the dockerfile for the default compute cluster.

It works!

diondrapeck commented 4 years ago

Thanks for sharing your solution, @hermandr!