Azure / doAzureParallel

A R package that allows users to submit parallel workloads in Azure
MIT License
107 stars 51 forks source link

Jobs don't start after registering cluster with getCluster() #330

Open angusrtaylor opened 5 years ago

angusrtaylor commented 5 years ago

Before submitting a bug please check the following:

R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS

Matrix products: default BLAS: /data/mlserver/9.3.0/runtime/R/lib/libRblas.so LAPACK: /data/mlserver/9.3.0/runtime/R/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] doAzureParallel_0.7.2 iterators_1.0.9 foreach_1.4.5 RevoUtilsMath_10.0.1 [5] RevoUtils_10.0.7 RevoMods_11.0.0 MicrosoftML_9.3.0 RevoScaleR_9.3.0
[9] lattice_0.20-35 rpart_4.1-11

loaded via a namespace (and not attached): [1] codetools_0.2-15 CompatibilityAPI_1.1.0 digest_0.6.17 rAzureBatch_0.6.2
[5] mime_0.5 bitops_1.0-6 grid_3.4.3 R6_2.2.2
[9] jsonlite_1.5 httr_1.3.1 curl_3.2 rjson_0.2.20
[13] tools_3.4.3 RCurl_1.95-4.11 yaml_2.2.0 compiler_3.4.3
[17] mrupdate_1.0.1

Description

I have an existing cluster created using the montecarlo_pricing_simulation.R script. In a fresh R session, I use getCluster as follows:

cluster <- getCluster("montecarlo", verbose = TRUE)

which outputs:

nodes: idle: 2 creating: 0 starting: 0 waitingforstarttask: 0 starttaskfailed: 0 preempted: 0 running: 0 other: 0 Your cluster has been registered. Dedicated Node Count: 0 Low Priority Node Count: 2

However, when I submit the job on batch, it hangs with the following message:

Id: job20181126153436 chunkSize: 13 enableCloudCombine: TRUE errorHandling: stop wait: TRUE autoDeleteJob: TRUE

The cluster nodes on the portal remain idle. Eventually, I get the following error:

Error in curl::curl_fetch_memory(url, handle = handle) : SSL read: error:00000000:lib(0):func(0):reason(0), errno 104

This happens to me with my own code also. I cannot successfully run jobs on an existing cluster that has been retrieved with getCluster().

Instruction to repro the problem if applicable

angusrtaylor commented 5 years ago

A workaround is just to run makeCluster(). You then get the following message and can run jobs on the cluster:

The specified cluster 'montecarlo' already exists. Cluster 'montecarlo' will be used. Your cluster has been registered.

brnleehng commented 5 years ago

Hi @angusrtaylor

Thanks, Brian

angusrtaylor commented 5 years ago

@brnleehng yes I'm using the monte carlo sample cluster configuration with 2 low priority nodes. However, the same occurs with another cluster I am using with 5 dedicated nodes.

Thanks Angus

brnleehng commented 5 years ago

What region are you currently in?

I'm also having issues with nodes saying they are idle, there are both low priority and dedicated. I will be investigating the batch node logs.

Thanks, Brian

angusrtaylor commented 5 years ago

I'm using westeurope. I'll try a different region and let you know if the same issue occurs. Thanks

angusrtaylor commented 5 years ago

@brnleehng FYI this issue is still occurring. I've experienced this in every region I've tried including westeurope and southcentralus

angusrtaylor commented 5 years ago

My workaround (using makeCluster) is also causing problems. If get the warning:

The specified cluster 'rbscl' already exists. Cluster 'rbscl' will be used. Your cluster has been registered. Dedicated Node Count: 0 Low Priority Node Count: 0 Warning message: In self$client$extractAzureResponse(response, content) : Conflict (HTTP 409).

zerweck commented 5 years ago

Could this be that this has something to do with the docker container settings? I made a diff between the HTTP Verbose Log between registering an existing via getCluster or via makeCluster. I found out two things:

  1. When running getCluster in a session where makeCluster has been run succesfully, it also works without problem for me. However, after deleting the cluster object and restarting the session, i can only use makeCluster.
  2. In the case of a non-working cluster object after running getCluster, all requests still work up to a certain point: When the following lines are printed:
    ============================
    Id: job20190628223825
    chunkSize: 1
    enableCloudCombine: TRUE
    errorHandling: pass
    wait: FALSE
    autoDeleteJob: TRUE
    ============================

    The next request is the PUT for jobxxx-metadata.rds. This one is the last to work. The POST to /jobs/jobxxx/tasks?api-version=2018-12-01.8.0 HTTP/1.1 after this breaks. The only differences in the requests are the in the Authorization: SharedKey in the HTTP header and one strange difference in the JSON payload: The containerSettings imageName is empty if running getCluster, but filled when running makeCluster