Open MislavSag opened 5 years ago
I'm sure Brian will probably reply with a better answer but; why not create your own dockerfile and build your image from scratch? That way you'll have full control of what's in the image, rather than trying to stack them through the doAzureParallel cluster boot process (which sounds a bit iffy to me). There's documentation available here to do this.
It sounds to me that the above method you describe probably wouldn't work, but I'm only basing that on a hunch, rather than any real deep technical understanding of how doAzureParallel works!
I don't' have experience in writing docker files, but if it is possible to set Selenium and R working together in docker file and run it in batch, I will surely start learning.
Dockerfiles are actually pretty straightforward - if you're already coding with R, Dockerfiles are trivial in comparison :)
I'm not sure what OS you use, but I'm assuming Linux. I've also never used Selenium so quickly Googled and found this: https://tecadmin.net/setup-selenium-with-firefox-on-ubuntu/
Given the Dockerfile below which takes standard Ubuntu commands (e.g. apt-get install) then maybe something like the below will work for you:
# Load rocker/tidyverse:3.4.1
FROM rocker/tidyverse:3.4.1
# Install any dependencies required for the R packages
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
apt-get -y install firefox \
libxml2-dev \
libcurl4-openssl-dev \
libssl-dev \
libxi6 \
libgconf-2-4 \
default-jdk
# Add all your other dependencies here
# Install the R Packages from CRAN
RUN Rscript -e 'install.packages(c())'
You'd have to substitute in your different dependencies of course but maybe something like that will do the trick?
Also have a Google and see if anyone else has already created a Docker image containing what you need. Unless it's something really niche, I bet someone has already created it. Or at least will have a dockerfile that has pretty much everything you need, and you can add in the last bits yourself.
Is this what you're after, maybe? https://rpubs.com/johndharrison/RSelenium-Docker
simon-tarr, thanks for help. I didn't mention, I am on windows 7.
As you pointed out, there are docker images for Selenium: https://hub.docker.com/u/selenium/ and I usually run Selenium through docker, that is, I pull and run Selenium image in Docker and than start it inside R:
library(RSelenium)
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4444L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.google.com/")
If I run above code in batch it want work since there are no Selenium images and they are not activated.
Is it enoufg to just add selenium image in docker file:
FROM rocker/tidyverse:3.4.1 FROM selenium/node-firefox
RUN apt-get update \ && apt-get install -y --no-install-recommends \ apt-get -y install firefox \ libxml2-dev \ libcurl4-openssl-dev \ libssl-dev \ libxi6 \ libgconf-2-4 \ default-jdk
RUN Rscript -e 'install.packages(c())'
I've just realised what you meant in your first message - sorry I misread/misunderstood. Initially thought you asked about loading two different docker images when booting an Azure pool (i.e. in the doAzureParallel configuration file). I now see that you've added FROM selenium/node-firefox into the dockerfile (which admittedly makes a lot more sense!).
I'd try building the image from the dockfile above but remove xi6, lobgconf and default-jdk...I should imagine they'd be included in selenium/node-firefox.
The instructions on creating containers can be found in that link that I posted in that first reply. Give it a try and let me know how you get on.
I will give it a try today to check if it is working.
Thanks again.
simon-tarr, if I add RSelenium package to Docker file, I got several dependency errors:
....
ERROR: dependency Rcpp is not available for package semver
* removing /usr/local/lib/R/site-library/semver
ERROR: dependency Rcpp is not available for package xml2
* removing /usr/local/lib/R/site-library/xml2
...
ERROR: dependencies xml2, semver are not available for package binman
* removing /usr/local/lib/R/site-library/binman
ERROR: dependencies binman, semver are not available for package wdman
* removing /usr/local/lib/R/site-library/wdman
ERROR: dependencies wdman, binman are not available for package RSelenium
* removing /usr/local/lib/R/site-library/RSelenium
....
Should I also add Rcpp in installed packages? How shoul I know which dpendency to include?
The error message above says which packages/dependencies you are missing. It looks like you're missing:
Rcpp, xml2, semver, binman and wdman. If these are R packages, they can be installed via the following within your dockerfile:
RUN Rscript -e 'install.packages(c("Rcpp", "xml2", "binman", "wdman", "semver"))'
You might find that certain libraries are required for these packages (e.g. the R package xml2 needs the library libxml2-dev to be installed but as this is already contained in the dockerfile, this particular R package will install fine, assuming it's the only library it requires), but hopefully the base docker image will have all of them already. If not you'll have to look at the error logs and see which ones are missing and add a line to your dockerfile (closing the line off with a backslash if it's not the last line, as per the example dockerfile above).
I think the problem is in:
ERROR: compilation failed for package Rcpp
* removing /usr/local/lib/R/site-library/Rcpp
It can't install Rcpp.
Why not use rocker/tidyverse, it should contain packages like Rcpp?
I'm not sure which docker images will contain what packages; you'd need to search to find out. According to the dockerfile for rocker/tidyverse it doesn't look like Rcpp is installed within this image: https://hub.docker.com/r/rocker/tidyverse/~/dockerfile/
I have moved forward since yesterday. I have successfully installed Rcpp and RSelenium. I trying to install java now since it returns error for java.check() function.
Here is the final docker file that works in docker locally, hope it will work on batch:
## Start with the official rocker image (lightweight Debian)
FROM rocker/r-base:latest
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
libxml2-dev \
libcurl4-openssl-dev \
libssl-dev \
phantomjs \
gnupg2
## Install Java
RUN echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" \
| tee /etc/apt/sources.list.d/webupd8team-java.list \
&& echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" \
| tee -a /etc/apt/sources.list.d/webupd8team-java.list \
&& apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886 \
&& echo "oracle-java8-installer shared/accepted-oracle-license-v1-1 select true" \
| /usr/bin/debconf-set-selections \
&& apt-get update \
&& apt-get install -y oracle-java8-installer \
&& update-alternatives --display java \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean \
&& R CMD javareconf
## make sure Java can be found in rApache and other daemons not looking in R ldpaths
RUN echo "/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/" > /etc/ld.so.conf.d/rJava.conf
RUN /sbin/ldconfig
# Install the R Packages from CRAN
RUN Rscript -e 'install.packages(c("Rcpp", "RSelenium"))'
Looks good! If it creates an image locally then it should run on Azure :)
Let me know how it turns out.
I would like to ask one more question. It is not related to docker container, but to performance. Lets, say I have 4 nested 4 loops like this:
for (1:39) {
#some code
for (j in 1:3) {
# some code
for (k in 1:25) {
#some code
for (l in 1:5000) {
# some code
}
}
}
}
Is it the best way to put foreach loop in most nested for loop? I have to say numbers are in some way random. Sometimes, third loop can be from 1:10 or last can be from 1:1000. Is it best way to include for loop in last part. Also is it possible to use clusterEval function?
I'll be honest - this is a little beyond me - it melts my brain thinking of more than two nested loops. I personally find it really difficult to read nested loops and only use them if I really have no other alternative...is there no way of vectorising your code at any stage?
EDIT - I realise that sometimes it's unavoidable to use for loops (and nested loops) but if you don't already know about vectorisation, here are some helpful links:
http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html https://www.dummies.com/programming/r/how-to-vectorize-your-functions-in-r/
You're probably not going to see vast performance gains but it would make reading your code easier in some situations.
simon-tarr, thanks for the links, I managed to simplify the loop. Have one, hopefully last question. Is it allowed to use functions from doParallel package in azureDoParallel?
For example I would need to to start Selenium in every instance on every VM. It doesn't make sense to do that every time for every step in the loop. Better way would be to start it in every node and that make for loop. I know how to do that using clusterEval function as in this code: https://stackoverflow.com/questions/38950958/run-rselenium-in-parallel
but there is no cl (cluster) object in doAzureParallel?
Hi @MislavSag
You should be able to use doParallel package in doAzureParallel. We use it on the merge task (The task that combines all of your loop results together).
You will need to make sure that a two tasks are not assign to the same VM or else both tasks will be fighting the same resources of the same tasks.
Thanks! Brian
My idea i to to start web driver in every VM, and then use driver sessions across VM's and nodes on every VM. I can't do that inside foreach loop since it distribute processes across nodes. In normal R session it would go something like:
rD <- RSelenium::rsDriver(
browser = "firefox",
extraCapabilities = list(
"moz:firefoxOptions" = list(
args = list('--headless')
)
)
)
clusterExport(cl, varlist = c("rD"))
foreach ....
but I am not sure how to achieve this using Batch, since I have more than one VM's.
I will try one more time again :)
On my local machine I can implement Selenium testing using headless Firefox in the following way:
# load packages
library(RSelenium)
# write my test function
test_fun <- function() {
# Some test function
}
# set cluster
cl <- parallel::makeCluster(detectCores() - 5)
registerDoParallel(cl)
# start headless frefox in specific port. I don't know how to start this on azure VM level?
# I should start it on every VM, not every node of VM
rD <- RSelenium::rsDriver(
browser = "firefox",
extraCapabilities = list(
"moz:firefoxOptions" = list(
args = list('--headless')
)
)
)
# export to nodes and start driver on every node
clusterExport(cl, "rD")
clusterEvalQ(cl, {
library(RSelenium)
library(RCurl)
library(httr)
driver <- rD$client
driver$open()
Sys.sleep(2)
driver$navigate("http://www.google.com")
})
# finally, do Selenium test in parallel
foreach_loop <- foreach(i = 1:nrow(df),
.packages = c("RSelenium"),
# .combine = 'rbind',
.export = c("df")) %dopar% {
test_fun()
}
My problem is that I don't know how to implement this code in Azure batch. I know how to do foreach loop, but not sure how to start headless Firefox on each VM, and than start sessions on each node (use clusterevalQ)?
Hi @MislavSag
There's no cluseterevalQ equivalent in doAzureParallel. The doAzureParallel package is more equivalent to doParallel instead of the 'parallel' R package.
Unfortunately, I'm not that familiar with RSelenium package. Do you need to start the Selenium server for all the tasks on each VM? Or do you need to start the Selenium driver on each VM? There is a start task command line in the pool cluster config. This command will be run on the host, not on docker image.
My thoughts are you can start another docker image with Selenium driver in the start task command line. Make sure that docker image persists even after the start task end then make sure the ports are open for both docker images.
https://github.com/Azure/doAzureParallel/blob/master/docs/01-getting-started.md#cluster-settings
Thanks, Brian
Hi @brnleehng,
I thought it would be best to run 2 docker containers first. But I didn't know how to start and use port from one container in another container. So I have chosen another way. I build a docker image that installs Firefox, Java and RSelenium, everything what is needed to tun Selenium inside R. I have tried to run a container from this image in docker on local machine and it worked, I could start Selenium and made some tests.
I would like to do that in parallel in batch which means I should execute following command on every VM: rD <- RSelenium::rsDriver( browser = "firefox", extraCapabilities = list( "moz:firefoxOptions" = list( args = list('--headless') ) ) )
This command would open port 4567L and made it available for RSelenium drivers. I can't send this command through foreach function since it is not possible to run more than one same port (4567L) on one machine.
Hi Mislav,
If you have the dockerfile up on docker hub, can you share it with me?
There is a way of getting parallelize your tests with RSelenium with doAzureParallel. However doAzureParallel does not support this scenario out of the box.
Here's the example: You can install RSelenium and doParallel packages
# Create a cluster with 5 VMs with 4 workers
cl <- doAzureParallel::makeCluster("cluster.json")
doAzureParallel::registerDoAzureParallel(cl)
# finally, do Selenium test in parallel
foreach_loop <- foreach(i = 1:nrow(df),
.packages = c("RSelenium", "doParallel", "parallel", "RCurl", "httr"),
# .combine = 'rbind',
.export = c("df")) %dopar% {
cl <- parallel::makeCluster(4)
registerDoParallel(cl)
rD <- RSelenium::rsDriver(
browser = "firefox",
extraCapabilities = list(
"moz:firefoxOptions" = list(
args = list('--headless')
)
)
)
# export to nodes and start driver on every node
clusterExport(cl, "rD")
clusterEvalQ(cl, {
library(RSelenium)
library(RCurl)
library(httr)
driver <- rD$client
driver$open()
Sys.sleep(2)
driver$navigate("http://www.google.com")
})
}
Thanks, Brian
Hi Brian,
You can find my dockerfile here: https://hub.docker.com/r/theanswer0207/firefox-headless-r/ It's the first dockerfile I have written, but I tried it and it worked. I have used it in my clster.json file.
Thanks for help.
Before I tried it I would like to check if I get everything tight.
My cluster.json file shoul look like this:
{
"name": "scraping",
"vmSize": "Standard_A1",
"maxTasksPerNode": 5,
"poolSize": {
"dedicatedNodes": {
"min": 4,
"max": 4
},
"lowPriorityNodes": {
"min": 0,
"max": 0
},
"autoscaleFormula": "QUEUE"
},
"containerImage": "theanswer0207/firefox-headless-r:latest",
"rPackages": {
"cran": [],
"github": [],
"bioconductor": []
},
"commandLine": [],
"subnetId": ""
}
Than start cluster with:
# Create a cluster with 5 VMs with 4 workers
cl <- doAzureParallel::makeCluster("cluster.json")
doAzureParallel::registerDoAzureParallel(cl)
Should I skip this part?:
setVerbose(TRUE)
setAutoDeleteJob(FALSE)
generateCredentialsConfig("credentials.json")
setCredentials("credentials.json")
generateClusterConfig("cluster.json")
cluster <- makeCluster("cluster.json")
registerDoAzureParallel(cluster)
getDoParWorkers()
opt <- list(wait = FALSE)
Then I see you have started RSelenium on every VM inside foreachloop. But this should be calle donly once, on the beggining of the loop. If I add one more line of code in your foreach loop, it will start driver every time:
# finally, do Selenium test in parallel
foreach_loop <- foreach(i = 1:nrow(df),
.packages = c("RSelenium", "doParallel", "parallel", "RCurl", "httr"),
# .combine = 'rbind',
.export = c("df")) %dopar% {
cl <- parallel::makeCluster(4)
registerDoParallel(cl)
rD <- RSelenium::rsDriver(
browser = "firefox",
extraCapabilities = list(
"moz:firefoxOptions" = list(
args = list('--headless')
)
)
)
# export to nodes and start driver on every node
clusterExport(cl, "rD")
clusterEvalQ(cl, {
library(RSelenium)
library(RCurl)
library(httr)
driver <- rD$client
driver$open()
Sys.sleep(2)
driver$navigate("http://www.google.com")
`my_vector[i] <- driver$findElement("xpath", "somePath")`
})
}
How to avoid cl
Sorry, updated the example. You are running a task on each VM. Azure Batch schedules each task on a VM first, round-robin style.
You can keep this section
setVerbose(TRUE)
setAutoDeleteJob(FALSE)
generateCredentialsConfig("credentials.json")
setCredentials("credentials.json")
generateClusterConfig("cluster.json")
cluster <- makeCluster("cluster.json")
registerDoAzureParallel(cluster)
getDoParWorkers()
opt <- list(wait = FALSE)
By using this method, you need the same number of tasks as VMs. In this case, it will be 2 tasks The caveat is you will need to split your dataframe before running the foreach loop and download the dataframe.
The property dedicatedNodes
means the number of VMs you want on your pool. The VM size Standard_A1
are 1 core machines.
{
"name": "scraping",
"vmSize": "Standard_F4",
"maxTasksPerNode": 1,
"poolSize": {
"dedicatedNodes": {
"min": 2,
"max": 2
},
"lowPriorityNodes": {
"min": 0,
"max": 0
},
"autoscaleFormula": "QUEUE"
},
"containerImage": "theanswer0207/firefox-headless-r:latest",
"rPackages": {
"cran": [],
"github": [],
"bioconductor": []
},
"commandLine": [],
"subnetId": ""
}
# Create a cluster with 5 VMs with 4 workers
cl <- doAzureParallel::makeCluster("cluster.json")
doAzureParallel::registerDoAzureParallel(cl)
# finally, do Selenium test in parallel
foreach_loop <- foreach(i = 1:number_of_vms,
.packages = c("RSelenium", "doParallel", "parallel", "RCurl", "httr"),
# .combine = 'rbind',
.export = c("df")) %dopar% {
data <- download.file("https://raw.githubusercontent.com/Test/Sample/" + i + ".csv")
# Standard_F4 are 4 core machines. Use all cores.. Same as your current workstation
cl <- parallel::makeCluster(4)
registerDoParallel(cl)
rD <- RSelenium::rsDriver(
browser = "firefox",
extraCapabilities = list(
"moz:firefoxOptions" = list(
args = list('--headless')
)
)
)
# export to nodes and start driver on every node
clusterExport(cl, "rD")
clusterEvalQ(cl, {
library(RSelenium)
library(RCurl)
library(httr)
driver <- rD$client
driver$open()
Sys.sleep(2)
driver$navigate("http://www.google.com")
})
}
Thanks, Brian
I am running something right now on Azure batch. After that, I will try your code immediately!
I have one question. Why this part:
# Standard_F4 are 4 core machines. Use all cores.. Same as your current workstation
cl <- parallel::makeCluster(4)
registerDoParallel(cl)
rD <- RSelenium::rsDriver(
browser = "firefox",
extraCapabilities = list(
"moz:firefoxOptions" = list(
args = list('--headless')
)
)
)
# export to nodes and start driver on every node
clusterExport(cl, "rD")
has to be inside foreach loop? Why can't I start driver before foreach loop on every VM and than start the foreach loop?
I have returned to this problem today. No success. The main problem is this function:
rD <- RSelenium::rsDriver(
browser = "firefox",
extraCapabilities = list(
"moz:firefoxOptions" = list(
args = list('--headless')
)
)
)
This function start a selenium server and browser. It should be started on each node (VM) separately. For example, on local machine I start it on the machine and than all cores can listen that driver. If I just put this command inside foreach loop it would try to start few times each driver on same machine which doesn't make sense.
Hi @MislavSag
I don't think doAzureParallel works out of the box because the container immediately gets removed once it's used (On every task).
The workaround is the example shown above.
Basically, we are running doParallel that also starts the selenium server and browser on each VM with a single task. The caveat is you need to run the foreach up to the number of VMs available (for example, foreach(i = 1:number_of_vms) and you have to manage the data spread).
Thanks, Brian
If I use
"maxTasksPerNode": 2,
"poolSize": {
"dedicatedNodes": {
"min": 2,
"max": 2
},
is number of VM's equal to 2 or 4 (2 * 2)?
You will have two VMs. In the above, 'min' refers to the minimum number of VMs you want in your cluster, 'max' referrs to the maximum. When min = max, it means that your cluster won't autoscale and, in this example, you'd have 2 VMs.
More information on autoscaling can be found within the documentation: https://github.com/Azure/doAzureParallel/blob/master/docs/32-autoscale.md
I am new to Azure batch service and this package. I was following instructions on the introduction page and successfully implement Azure batching for a simple foreach loop.
I saw that in configuration file, there is a parameter "containerImage" with default "rocker/tidyverse:3.4.1". I am not sure is it possible to add two or more images in "containerImage" and use both? More concretely, is it possible to put "selenium/standalone-firefox" image to containerImage parameter and pull it together with "rocker/tidyverse:3.4.1"? If the answer is yes, is it possible to run Selenium inside R script in usual way using RSelenium package?