Open angusrtaylor opened 5 years ago
A workaround is just to run makeCluster(). You then get the following message and can run jobs on the cluster:
The specified cluster 'montecarlo' already exists. Cluster 'montecarlo' will be used. Your cluster has been registered.
Hi @angusrtaylor
Thanks, Brian
@brnleehng yes I'm using the monte carlo sample cluster configuration with 2 low priority nodes. However, the same occurs with another cluster I am using with 5 dedicated nodes.
Thanks Angus
What region are you currently in?
I'm also having issues with nodes saying they are idle, there are both low priority and dedicated. I will be investigating the batch node logs.
Thanks, Brian
I'm using westeurope. I'll try a different region and let you know if the same issue occurs. Thanks
@brnleehng FYI this issue is still occurring. I've experienced this in every region I've tried including westeurope and southcentralus
My workaround (using makeCluster) is also causing problems. If get the warning:
The specified cluster 'rbscl' already exists. Cluster 'rbscl' will be used. Your cluster has been registered. Dedicated Node Count: 0 Low Priority Node Count: 0 Warning message: In self$client$extractAzureResponse(response, content) : Conflict (HTTP 409).
Could this be that this has something to do with the docker container settings? I made a diff between the HTTP Verbose Log between registering an existing via getCluster
or via makeCluster
. I found out two things:
getCluster
in a session where makeCluster
has been run succesfully, it also works without problem for me. However, after deleting the cluster object and restarting the session, i can only use makeCluster
.getCluster
, all requests still work up to a certain point: When the following lines are printed:
============================
Id: job20190628223825
chunkSize: 1
enableCloudCombine: TRUE
errorHandling: pass
wait: FALSE
autoDeleteJob: TRUE
============================
The next request is the PUT
for jobxxx-metadata.rds. This one is the last to work. The POST
to /jobs/jobxxx/tasks?api-version=2018-12-01.8.0 HTTP/1.1
after this breaks. The only differences in the requests are the in the Authorization: SharedKey
in the HTTP header and one strange difference in the JSON payload: The containerSettings imageName
is empty if running getCluster
, but filled when running makeCluster
Before submitting a bug please check the following:
sessionInfo()
R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS
Matrix products: default BLAS: /data/mlserver/9.3.0/runtime/R/lib/libRblas.so LAPACK: /data/mlserver/9.3.0/runtime/R/lib/libRlapack.so
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] doAzureParallel_0.7.2 iterators_1.0.9 foreach_1.4.5 RevoUtilsMath_10.0.1 [5] RevoUtils_10.0.7 RevoMods_11.0.0 MicrosoftML_9.3.0 RevoScaleR_9.3.0
[9] lattice_0.20-35 rpart_4.1-11
loaded via a namespace (and not attached): [1] codetools_0.2-15 CompatibilityAPI_1.1.0 digest_0.6.17 rAzureBatch_0.6.2
[5] mime_0.5 bitops_1.0-6 grid_3.4.3 R6_2.2.2
[9] jsonlite_1.5 httr_1.3.1 curl_3.2 rjson_0.2.20
[13] tools_3.4.3 RCurl_1.95-4.11 yaml_2.2.0 compiler_3.4.3
[17] mrupdate_1.0.1
Description
I have an existing cluster created using the montecarlo_pricing_simulation.R script. In a fresh R session, I use getCluster as follows:
cluster <- getCluster("montecarlo", verbose = TRUE)
which outputs:
nodes: idle: 2 creating: 0 starting: 0 waitingforstarttask: 0 starttaskfailed: 0 preempted: 0 running: 0 other: 0 Your cluster has been registered. Dedicated Node Count: 0 Low Priority Node Count: 2
However, when I submit the job on batch, it hangs with the following message:
Id: job20181126153436 chunkSize: 13 enableCloudCombine: TRUE errorHandling: stop wait: TRUE autoDeleteJob: TRUE
The cluster nodes on the portal remain idle. Eventually, I get the following error:
Error in curl::curl_fetch_memory(url, handle = handle) : SSL read: error:00000000:lib(0):func(0):reason(0), errno 104
This happens to me with my own code also. I cannot successfully run jobs on an existing cluster that has been retrieved with getCluster().
Instruction to repro the problem if applicable
Create a cluster
Restart R session
Load cluster with getCluster
Try and submit a job