Azure / doAzureParallel

A R package that allows users to submit parallel workloads in Azure
MIT License
107 stars 51 forks source link

Warning message: In quit... system call failed: Cannot allocate memory #322

Open ctlamb opened 5 years ago

ctlamb commented 5 years ago

Some of my nodes are failing with this error:

Warning message: In quit(save = "yes", status = workerErrorStatus, runLast = FALSE) : system call failed: Cannot allocate memory

Does this mean I need a CPU with more memory?

brnleehng commented 5 years ago

Hi @ctlamb

Yes, this means you will need a CPU with more memory. I suggest measuring the memory usage for each task so you have a benchmark of what Azure VM to use.

Thanks Brian

ctlamb commented 5 years ago

Excellent, will do. In the meantime, I tried to use a machine with slightly more memory ("vmSize" = "Standard_E4_v3"),, but I am running into the following error after I run foreach (this error doesn't occur with "vmSize" = "Standard_DS12_v2")

##Error: No automatic parser available for 7b/.

brnleehng commented 5 years ago

What region are you in? It could be possible that Standard_E4_v3 is not available in your region. Is this happening during makeCluster?

ctlamb commented 5 years ago

I'm in West US. The error throws in foreach

It looks like my tasks only use a max of 8GB of RAM, so the 28GB of ram I had in the "Standard_DS12_v2" should've been plenty. hmmmm. Not sure what's going on here

Memory usage readout


                                                                                                            Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1                                                                             doAzureParallel::setCredentials(credentials)            0.005                0.0               0.0
2                                                                                     mod<-mod.files$FilePath[bp$model[i]]            0.000                0.0               0.0
3                                                                                       tile<-r.files$FilePath[bp$tile[i]]            0.000                0.0               0.0
4      doAzureParallel::getStorageFile(container="occmodels",blobPath=paste0(mod),downloadPath=paste0(mod),overwrite=TRUE)           49.721              190.6             190.6
5                                                                                                brt<-readRDS(paste0(mod))           10.233              665.8             665.8
6  doAzureParallel::getStorageFile(container="rastertiles",blobPath=paste0(tile),downloadPath=paste0(tile),overwrite=TRUE)          496.358             1996.0            1996.0
7                                                     unzip(paste0(tile),exdir=here::here(),junkpaths=TRUE,overwrite=TRUE)           27.612                0.0               0.0
8                                                    raster_data<-list.files(here::here(),pattern=".tif$",full.names=TRUE)            0.150                0.0               0.0
9                                                                                        STACK<-raster::stack(raster_data)            2.337                0.3               6.0
10                                                  STACK[["CutBlock_Occurrence"]]<-ratify(STACK[["CutBlock_Occurrence"]])            5.092                0.0            1161.7
11                                                                        STACK[["Fire_Occ"]]<-ratify(STACK[["Fire_Occ"]])            5.012                0.0            1161.7
12                                                                          STACK[["CRDP_LC"]]<-ratify(STACK[["CRDP_LC"]])            5.132                0.0            1161.7
13                                                                        STACK[["MODIS_LC"]]<-ratify(STACK[["MODIS_LC"]])            4.990                0.0            1161.7
14                                         pred<-dismo::predict(STACK,brt,n.trees=brt$gbm.call$best.trees,type="response")        22156.271              387.8            8056.5
15                                                                                                            return(pred)            0.000                0.0               0.0
brnleehng commented 5 years ago

Are you setting maxTasksPerNode greater than 1 in your cluster configuration?

ctlamb commented 5 years ago

No it's =1

clusterConfig <- list( "name" = "LambRaster", "vmSize" = "Standard_DS12_v2", "maxTasksPerNode" = 1, "poolSize" = list( "dedicatedNodes" = list( "min" = 1, "max" = 200 ), "lowPriorityNodes" = list( "min" = 0, "max" = 0 ), "autoscaleFormula" = "QUEUE" ), "containerImage" = "rocker/geospatial:latest", "rPackages" = list( "cran" = c("doParallel", "here", "dismo", "gbm", "snow"), "github" = c("Azure/doAzureParallel"), "bioconductor" = c() ), "commandLine" = list() )

ctlamb commented 5 years ago

Is there a better/preferred package I could use to measure the memory usage?

ctlamb commented 5 years ago

Now getting Error: No automatic parser available for 7b/. even when I use the D12 machine now. ugghh, always hard to trouble shoot one issue (memory) when another pops up. Any thoughts? I could start a new thread if its easier

brnleehng commented 5 years ago

I don't have a preferred package for measuring memory usage. Where exactly is this error occuring? Is this when the foreach is getting results?

If you have a cluster configuration file and a reproducible sample, I will work on identifying the issue

simon-tarr commented 5 years ago

This is the same as issue #315. I've spent many an hour pulling my hair out over this issue and I've no idea what's causing it. I've provided a lot of qualitative information in #315 but haven't had time to build a fully reproducible example at the scale which I think is generating the error.

@ctlamb is your workflow using resource files uploaded to Azure storage? My workflow is and I haven't been able to determine whether the 7b error still occurs when not using resource files. I'd like to attempt to rule out whether resource files could be contributing in some way.

ctlamb commented 5 years ago

Yes I am uploading and downloading data to Azure storage in my workflow. I do wonder if this was an internet issue? My internet speed was recently upgraded, and I haven't got the 7b error since..but thats only based on 5-10 different tries so far. Will update if anything changes

simon-tarr commented 5 years ago

Yes I am uploading and downloading data to Azure storage in my workflow. I do wonder if this was an internet issue? My internet speed was recently upgraded, and I haven't got the 7b error since..but thats only based on 5-10 different tries so far. Will update if anything changes

Thanks for the extra information. My latest post at #315 documents the return of the dreaded 7b error.

I considered your idea here as well. However, my university network is a gigabit connection and it's rock stable. My home internet is a 100Mb fire connection which is also super reliable (for the most part).

I wonder if there's a limit to the number of connections Batch/HTTR can accept from a single IP address? I'm currently running 2 pools on my laptop (home network) and three on my uni workstation all day and they've been stable all day. If I try and run any more pools than this on either machine, the 7b error will return almost instantly. It's very strange...

brnleehng commented 5 years ago

Are all of your workflows in interactive mode? (Waiting for the job to be done)

Thanks, Brian

simon-tarr commented 5 years ago

Mine is, yes.

simon-tarr commented 5 years ago

Any news on the status of this error? It's still happening to me with frustrating regularity.

Thanks!