Azure / doAzureParallel

A R package that allows users to submit parallel workloads in Azure
MIT License
107 stars 51 forks source link

Specify Operating System in cluster.json #43

Closed simon-tarr closed 7 years ago

simon-tarr commented 7 years ago

Hello, is it possible to specify which OS your pool should start? By default it appears as though the script spins up microsoft-ads linux-data-science-vm linuxdsvm.

I have a package that can only run in a Windows environment - I'm hoping that I can specify this in the config script. Thanks in advance for any assistance.

EDIT - Some extra information. Since posting the above I have read that doAzureParallel uses "Data Science Virtual Machine (DSVM)...This package uses the Linux Edition of the DSVM which comes preinstalled with Microsoft R Server Developer edition" [https://github.com/Azure/doAzureParallel/blob/master/docs/00-azure-introduction.md]. What does this mean in practice?

When I attempt to provision a pool with the aforementioned Windows-only package from Github, the nodes never start - they turn orange within the pool information window ("Start task failed"). I have put this down to the fact that the pool is trying to load a Windows-only package in a Linux environment. If I remove the Github string, the pool provisions as expected but tasks requiring the package never run (for obvious reasons).

paselem commented 7 years ago

Hello Simon.

Unfortunately this package takes a hard dependency on the Linux OS and would be quite difficult to migrate over to Windows. We do not have any plans at this time to support our package on Windows, although we understand the issue you are running into.

Can you provide some information around which package you are using? I wonder if there would be a replacement you could use instead?

Thanks, -Pablo

simon-tarr commented 7 years ago

Hi Pablo,

Thanks for the response. I was hoping that wasn't going to be the case! I'm a PhD researcher and I'm trying to run a model called NicheMapR of which there is no replacement - the functions within the R package call on code which has only been compiled for Windows and MacOS. I tried in vain to get a working solution on my institution's super computer but, as it's UNIX, the package can't be installed.

I then spent significant time researching Azure as it's easy to deploy Windows nodes. I found doAzureParallel during this research. To now find out that a Microsoft-developed package and cloud solution can't run in a Windows environment is such a shame (and a little ironic!). Unfortunately the package developer also told me there are no plans to port it to UNIX in the immediate future so I am very much stuck.

I suppose a possible solution could be to create a single virtual machine with 64 cores and RDC to that? I think this will work but I will of course be limited in the amount of processing I'm able to undertake.

Long term, is deploying a Windows instance something that you're considering?

If you are interested, the package I am trying to install can be found at mrke/nichemapr/

All the best Simon

paselem commented 7 years ago

Hey Simon,

That feedback helps a lot. I agree with you that it's a little bit ironic that we are focused on Linux OS's but for most of our use-cases it does have some advantages. This is actually the first time I've heard someone having a windows-only R package (which is good to know about!). Are you aware of any other packages that are Windows-only?

As far as your workload, can you share a bit more details regarding the processing and data requirements? Are you doing simulations, modelling? How many data points or iterations are you trying to do?

Have you considered Microsoft R Server in your research? It may be worth looking into. They have developed a pretty nice feature set that can allow you to scale with a mixture of hardware and software solutions. Without knowing too much about your workload MRS Deploy could be an interesting solution.

You asked a question in your original thread that I did not address: This package uses the Linux Edition of the DSVM which comes preinstalled with Microsoft R Server Developer edition" [https://github.com/Azure/doAzureParallel/blob/master/docs/00-azure-introduction.md]. What does this mean in practice?

I'm not sure if you are familiar with the DSVM, but it's a great pre-created virtual machine image that comes packed with a ton of data-science software including R & R Studio, Python-Anaconda, SQL, Tensorflow, a bunch of editors and data sets, and lots more. It's a great way to quickly bring up a VM that has pretty much most of what you'd need to do your work. We have been using that as our base image in doAzureParallel, but for a variety of reasons we are moving away from it and looking to support containers instead. I'm bringing this up because as part of that work, we could look at how much effort it would be to support a Windows-based container alongside our planned Linux one (with the caveat that there are other higher priority items in our backlog at the moment). This may hopefully unblock scenarios like yours in the future.

Thanks, -Pablo

simon-tarr commented 7 years ago

Hi Pablo,

It's the first instance of a Windows/MacOS only package that I've come across, too. Typical that it would be the one I need the most. I'm unaware of other packages that may be Windows-only although I suspect they exist if they're calling on proprietary/legacy code that's only compiled for Windows.

With regards to the data and processing requirements; At the moment they are relatively small - I need to run two sub-models using the above package which takes approx. 0.6s per iteration to run (the two iterations take place for the centre of each grid cell across a landscape at a user-specified resolution e.g. 1km^2, 10km^2 etc.) . At present the spatial extent I am computing across is small (just the Caribbean at around 220,000km^2 so 440,000 runs in total) but moving forward, I would like to extend it across many more species and the entire globe at 1km^2 (which is iterations in the many tens of millions). Fortunately the problem is embarrassingly parallel - I just need a lot of cores. The doAzureParallel package appeared ideal because I'd already written my code to run in parallel locally (to compute distributions across small Caribbean islands) so it would be trivial to push it to the cloud to scale things up.

I don't know much about MRS deploy or DVSM but thank you for some background information and the links - I will look into these as potential solutions this week.

I'm very much in the dark with regards to the technical work that underpins a Windows-based container but, in my use scenario, it would be very simple - Windows (any version) and R, with the ability to install packages via GitHub (as is already a feature for Linux doAzureParallel). That's it, really.

All the best Simon

simon-tarr commented 7 years ago

Hi Pablo,

A quick update on this thread. I've been having a poke around the Azure Portal and I've noticed that there's already a Windows-based DVSM. According to the brief documentation is says:

The 'Data Science Virtual Machine (DSVM)' is a 'Windows Server 2016 with Containers' VM & includes popular tools for data exploration, analysis, modeling & development.

Highlights:

Pre-configured and tested with Nvidia drivers, CUDA Toolkit 8.0, & NVIDIA cuDNN library for GPU workloads available if using NC class VM SKUs.

I was wondering how this Windows-based DVSM differs from the Linux one that doAzureParallel requests? Is it a considerable engineering challenge to call this Windows DVSM over the Linux one within the package?

Many thanks Simon

paselem commented 7 years ago

Hi Simon, You are absolutely correct in identifying the Windows DSVM and that it should be sufficient to host our work. The issue is that all of our additional setup on the VM to make sure our pacakge works correctly assumes a Linux OS so our commands look like "/bin/bash <something linux specific" rather than "cmd.exe -c ".

As I mentioned before we are actively investigating supporting containers at which point it may make sense to have a Windows supported container where we can make sure everything works. This is roughly the 3rd highest priority item on our list, so I am hoping that we can start implementation near the end of this month if not sooner. At that point we will need to rethink how we prepare a node for doAzureParallel and can take a look at costing out a Windows-supported container.

Do you have any specific timelines you're targeting?

Thanks, -Pablo

dustindall commented 7 years ago

Hi Pablo,

This might not be the thread for this, but can you update us on your priority list?

I'm always looking forward to new developments with R and Azure.

Thanks, Dustin

paselem commented 7 years ago

Hi @dustindall this is probably the wrong place to discuss that. I have brought it up with the team though. We are thinking about how to track this on GitHub and make users aware of future features and timelines. We're thinking of using projects to the GitHub wiki... We'll have something up in the next week or so. Keep tuned!

Thanks, -Pablo

simon-tarr commented 7 years ago

Hi Pablo,

Many thanks for your reply. I didn't mean to hassle you - sorry! I'm not a software developer so I can't really envision the process or work required to get something as complex as this out of the door!

With regards to timelines - the sooner the better but I do have some flexibility. If I can provision a 64-core VM, install R, and RDC to that, then it will tide me through my current project which is on track to take me up to around Christmas. After this project is out of the way, I would like to conduct some global analyses and this is really where I will benefit from the power of Azure. I wouldn't be able to carry out the global analyses (at the resolution I would like) without the cloud. I can decrease the resolution to decrease the number of model iterations but I'd like to avoid that if I can wait some months for a possible solution.

Many thanks Simon

paselem commented 7 years ago

Hello Simon,

Not a hassle. I fully understand why you're set on looking for a solution to this - it's pretty much why we started this project in the first place. I am happy to have people challenge our assumptions and priorities since it keeps us focused on solving what people actually need. As I mentioned above, we will take a look at what it will take to support a Windows-based container solution as we figure out the overall container feature.

Thanks for all the feedback, and please keep it coming. -Pablo

simon-tarr commented 7 years ago

Excellent, thank you. Looking forward to an update in due course.

paselem commented 7 years ago

Hello @simon-tarr

Looks like there has been an update to the package which now supports Linux! https://github.com/mrke/NicheMapR/pull/1. Might be worth giving it a try to see if it works.

Thanks, -Pablo

simon-tarr commented 7 years ago

Well how about that! Thanks for the heads up too - I'll run some testing this week on Azure and see what happens. Cheers, Simon.

simon-tarr commented 7 years ago

Unfortunately I can't get the package to work with doAzureParallel. I get a start task failed error when I check my pool at portal.azure.com I have set setVerbose(TRUE) and run the cluster.json script, hoping to find the signature of an error to help troubleshoot the problem but I'm getting nowhere. Perhaps more development is required within NicheMapR's new Linux support for this to work as expected.

paselem commented 7 years ago

That's unfortunate. To see what is going on with the start task, were you able to get the errors from the logs? If so, can you share them? Perhaps there is something in there that can help identify the root cause.

Thanks, -Pablo

simon-tarr commented 7 years ago

Hi Pablo, would the errors be shown in the console when setVerbose(True) is set? I couldn't see an obvious error message in the console when running doAzureParallel::makeCluster() but then I don't really know what I'm looking for!

Cheers, Simon.

paselem commented 7 years ago

No, I don't think it will show up with verbose printing. We are working on a way to improve this experience because we know it's a pain point. Right now, you may have to go to the Azure portal to look at the logs

  1. Log into the portal at portal.azure.com
  2. Find your batch account and open it
  3. Click on Pools in the left navigation menu
  4. Open your pool
  5. Select 'Nodes' in the left navigation menu
  6. Select any node (it's state should be 'starttaskfailed')
  7. Select 'Files' in the left navigation menu
  8. Look for startup.stdout.txt and startup/stderr.txt and take a look at them.

Thanks, -Pablo

simon-tarr commented 7 years ago

Brilliant, thank you. I appreciate the help and I realise this is outside your remit to offer support on this, so thank you. I'll report back when I have a bit more information. Cheers, Simon.

simon-tarr commented 7 years ago

I have contacted the developer who has worked to bring Linux support to NicheMapR. He says:

Including binary libs is not very common so there are bound to be some issues with this, thanks for pointing it out. It seems the main question is if MICROCLIMATE.so is actually in the folder /usr/lib64/microsoft-r/3.3/lib64/R/library/NicheMapR/libs/linux/ ? If so the question is why is it not there? The standard R install() process normally copies everything in the inst/ folder to the project directory so NicheMapR/inst/libs/linux/ ends up in NicheMapR/libs/linux/ - but this is missing in the error. For some reason your provisioning process is not doing a normal install. A build flag like --no-inst or --no-libs could be responsible, but some other flags could do this too. What is the build process for the NichMapR package in your provisioning?

This was in response to the error message I found from the logs:

gfortran   -fpic  -g -O2  -c gads.f -o gads.o
gcc -std=gnu99 -shared -L/usr/lib64/microsoft-r/3.3/lib64/R/lib -o NicheMapR.so gads.o -lgfortran -lm -L/usr/lib64/microsoft-r/3.3/lib64/R/lib -lR
mv NicheMapR.so gads.so
Error : .onLoad failed in loadNamespace() for 'NicheMapR', details:
  call: dyn.load(micro_lib)
  error: unable to load shared object '/usr/lib64/microsoft-r/3.3/lib64/R/library/NicheMapR/libs/linux/MICROCLIMATE.so':
  libgfortran.so.4: cannot open shared object file: No such file or directory
Error: loading failed
Execution halted

I'm not sure how packages are built within DVSMs - if you have any information regarding this, it would be very helpful. Thanks again, Simon.

paselem commented 7 years ago

Hello Simon,

I unfortunately can't repro this since I don't have the package installed on my local environment. I took a quick look at the library directory and obviously don't have NicheMapR since I'm not installing it.

One thing you can do is to SSH into the node yourself and poke around. Following the same set of instructions I outlined above, you can navigate to a node in the Portal and then

Once connected, you can cd into the library directory to poke around

cd /usr/lib64/microsoft-r/3.3/lib64/R/library/

or go straight to where the package should be based on the author's feedback:

cd /usr/lib64/microsoft-r/3.3/lib64/R/library/NicheMapR/libs/linux/ 

@brnleehng - Are you aware of the package using --no-inst or --no-libs when installing packages?

@simon-tarr - Can you share your cluster config or snippet of code where you are installing the package?

Thanks, -Pablo

paselem commented 7 years ago

@simon-tarr - I am going to close this since it looks like there is a NicheMapR for linux. At the moment there are no plans to support Windows as the base OS.