cloudera / poisson_sampling

10 stars 13 forks source link

running fitRandomForrest with small input data sample results in the exception (M/R terminates) #4

Closed andrewmilkowski closed 11 years ago

andrewmilkowski commented 11 years ago

Second issue is that if input data sample is reduced (example below will only use 20 rows from the overall training set)

transactions <- read.table(file="../downloads/train.csv",

nrows=1000,

                       nrows=20,

running fitRandomForrest will terminate will the following exception:

Loading required package: randomForest randomForest 4.6-7 Type rfNews() to see new features/changes/bug fixes. Loading required package: rmr2 Loading required package: Rcpp Loading required package: RJSONIO Loading required package: methods Loading required package: bitops Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr Loading required package: reshape2 Dotted pair list of 12 $ : language (function() { load("./rmr-local-envaaeb61a5a326") ... $ : language rmr2:::map.loop(map = map, keyval.reader = input.reader(), keyval.writer = if (is.null(reduce)) { output.writer() ... $ : language as.keyval(map(keys(kv), values(kv))) $ : language is.keyval(x) $ : language map(keys(kv), values(kv)) $ : language c.keyval(lapply(1:num.models, generate.sample)) $ : language f.single(args[[1]]) $ : language lapply(kvs, recycle.keyval) $ : language FUN(X[[1L]], ...) $ : language keyval(rmr.recycle(k, v), rmr.recycle(v, k)) $ : language rmr.recycle(k, v) $ : language rmr.str(lx) lx int 1 Dotted pair list of 12 $ : language (function() { load("./rmr-local-envaaeb61a5a326") ... $ : language rmr2:::map.loop(map = map, keyval.reader = input.reader(), keyval.writer = if (is.null(reduce)) { output.writer() ... $ : language as.keyval(map(keys(kv), values(kv))) $ : language is.keyval(x) $ : language map(keys(kv), values(kv)) $ : language c.keyval(lapply(1:num.models, generate.sample)) $ : language f.single(args[[1]]) $ : language lapply(kvs, recycle.keyval) $ : language FUN(X[[1L]], ...) $ : language keyval(rmr.recycle(k, v), rmr.recycle(v, k)) $ : language rmr.recycle(k, v) $ : language rmr.str(ly) ly int 0 Error in rmr.recycle(k, v) : Can't recycle 0-length argument Calls: ... c.keyval -> f.single -> lapply -> FUN -> keyval -> rmr.recycle Execution halted java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:390) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) at org.apache.hadoop.mapred.Child$4.run(Child.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.Child.main(Child.java:260)

laserson commented 11 years ago

It's not clear to me where the read.table call is coming in, as fitRandomForest.R only consumes data from Hadoop. Perhaps some map tasks are somehow calling keyval with no data?

laserson commented 11 years ago

I would also cross-post on the rmr repo as well, as it appears the error is generated in an rmr function.

andrewmilkowski commented 11 years ago

will do, believe you are right in this particular test case scenerio

piccolbo commented 11 years ago

A rmr.str(v) at the beginning of the map function would clarify the issue. It seems Uri interpretation is correct but it begs the question of why that happens.

andrewmilkowski commented 11 years ago

Antonio,

let me transfer this comment and further discussion to rmr2 ticket area (https://github.com/RevolutionAnalytics/rmr2/issues/69) , as to isolate the issue to correct component, for now...

I have added proposed debug statement in the beginning of the mapper function,

MAP function

poisson.subsample <- function(k, input) { rmr.str(input)

this function is used to generate a sample from the current block of data

following is the output in the stderr logs

v), values(kv)) $ : language rmr.str(input) input 'data.frame': 10 obs. of 74 variables: $ SalePrice : num 26500 9500 19000 11500 65000 24000 38500 13500 21500 36000 $ ModelID.x : Factor w/ 9 levels "21442","2232",..: 8 1 7 3 6 8 4 9 5 2 $ datasource : Factor w/ 1 level "121": 1 1 1 1 1 1 1 1 1 1 $ auctioneerID : Factor w/ 1 level "3": 1 1 1 1 1 1 1 1 1 1 $ YearMade : num 2004 2003 1999 1991 1000 ... $ MachineHoursCurrentMeter: num 508 0 2450 8005 20700 ... $ UsageBand : Factor w/ 3 levels "High","Low","Medium": 2 NA 3 3 3 3 1 2 2 NA $ saledate : Factor w/ 10 levels "2005-10-20","2005-11-17",..: 7 9 3 2 5 6 10 4 8 1 $ fiModelDesc.x : Factor w/ 9 levels "310E","310G",..: 2 5 1 6 7 2 8 3 4 9 $ fiBaseModel.x : Factor w/ 8 levels "310","334","430",..: 1 4 1 5 6 1 7 2 3 8 $ fiSecondaryDesc.x : Factor w/ 6 levels "B","E","G","HAG",..: 3 5 2 6 1 3 NA NA 4 NA $ fiModelSeries.x : Factor w/ 2 levels "-6E","LC": NA NA NA NA NA NA 1 NA NA 2 $ fiModelDescriptor.x : int NA NA NA NA NA NA NA NA NA 6 $ ProductSize : Factor w/ 4 levels "Large","Large / Medium",..: NA 3 NA NA 1 NA 4 3 3 2 $ fiProductClassDesc.x : Factor w/ 6 levels "Backhoe Loader - 14.0 to 15.0 Ft Standard Digging Depth",..: 1 5 1 1 6 1 2 4 4 3 $ state : Factor w/ 8 levels "Arizona","Arkansas",..: 1 8 2 4 3 6 7 3 7 5 $ ProductGroup.x : Factor w/ 3 levels "BL","TEX","WL": 1 2 1 1 3 1 2 2 2 2 $ ProductGroupDesc.x : Factor w/ 3 levels "Backhoe Loaders",..: 1 2 1 1 3 1 2 2 2 2 $ Drive_System : Factor w/ 2 levels "Four Wheel Drive",..: 1 NA 2 2 NA 1 NA NA NA NA $ Enclosure : Factor w/ 3 levels "EROPS","EROPS w AC",..: 3 1 3 1 2 3 2 1 1 1 $ Forks : logi NA NA NA NA NA NA ... $ Pad_Type : Factor w/ 1 level "Street": NA NA NA NA NA 1 NA NA NA NA $ Ride_Control : Factor w/ 1 level "No": 1 NA 1 1 NA 1 NA NA NA NA $ Stick : Factor w/ 2 levels "Extended","Standard": 1 NA 2 2 NA 2 NA NA NA NA $ Transmission : Factor w/ 2 levels "Powershuttle",..: 1 NA 2 2 NA 2 NA NA NA NA $ Turbocharged : logi NA NA NA NA NA NA ... $ Blade_Extension : logi NA NA NA NA NA NA ... $ Blade_Width : logi NA NA NA NA NA NA ... $ Enclosure_Type : logi NA NA NA NA NA NA ... $ Engine_Horsepower : logi NA NA NA NA NA NA ... $ Hydraulics : Factor w/ 2 levels "2 Valve","Auxiliary": NA 2 NA NA 1 NA 1 2 2 2 $ Pushblock : logi NA NA NA NA NA NA ... $ Ripper : logi NA NA NA NA NA NA ... $ Scarifier : logi NA NA NA NA NA NA ... $ Tip_Control : logi NA NA NA NA NA NA ... $ Tire_Size : logi NA NA NA NA NA NA ... $ Coupler : Factor w/ 1 level "Manual": NA NA NA NA NA NA NA NA 1 NA $ Coupler_System : logi NA NA NA NA NA NA ... $ Grouser_Tracks : logi NA NA NA NA NA NA ... $ Hydraulics_Flow : logi NA NA NA NA NA NA ... $ Track_Type : Factor w/ 2 levels "Rubber","Steel": NA 2 NA NA NA NA NA 1 1 2 $ Undercarriage_Pad_Width : int NA 16 NA NA NA NA NA NA NA NA $ Stick_Length : num NA NA NA NA NA NA NA NA NA 132 $ Thumb : logi NA NA NA NA NA NA ... $ Pattern_Changer : logi NA NA NA NA NA NA ... $ Grouser_Type : Factor w/ 1 level "Double": NA 1 NA NA NA NA NA 1 1 1 $ Backhoe_Mounting : logi NA NA NA NA NA NA ... $ Blade_Type : logi NA NA NA NA NA NA ... $ Travel_Controls : logi NA NA NA NA NA NA ... $ Differential_Type : Factor w/ 1 level "Standard": NA NA NA NA 1 NA NA NA NA NA $ Steering_Controls : Factor w/ 1 level "Conventional": NA NA NA NA 1 NA NA NA NA NA $ saledatenumeric : num 14231 14637 13468 13104 13734 ... $ ageAtSale : num 1539 2311 2603 5161 367746 ... $ saleYear : num 2008 2010 2006 2005 2007 ... $ saleMonth : Factor w/ 7 levels "August","December",..: 2 3 6 6 1 1 5 4 1 7 $ saleDay : Factor w/ 10 levels "09","14","16",..: 5 10 3 4 1 8 6 2 9 7 $ saleWeekday : Factor w/ 1 level "Thursday": 1 1 1 1 1 1 1 1 1 1 $ MedianModelPrice : int 25250 9500 19000 11500 65000 25250 38500 13500 21500 36000 $ ModelCount : num 2 1 1 1 1 2 1 1 1 1 $ ModelID.y : Factor w/ 9 levels "16705","21442",..: 8 2 7 4 6 8 1 9 5 3 $ fiModelDesc.y : Factor w/ 9 levels "310E","310G",..: 2 5 1 6 7 2 9 3 4 8 $ fiBaseModel.y : Factor w/ 8 levels "310","334","430",..: 1 4 1 5 6 1 8 2 3 7 $ fiSecondaryDesc.y : Factor w/ 6 levels "B","E","G","LC",..: 3 5 2 6 1 3 NA NA NA 4 $ fiModelSeries.y : int NA NA NA NA NA NA -6 NA NA 6 $ fiModelDescriptor.y : Factor w/ 1 level "LK": NA NA NA NA NA NA NA NA NA 1 $ fiProductClassDesc.y : Factor w/ 6 levels "Backhoe Loader - 14.0 to 15.0 Ft Standard Digging Depth",..: 1 3 1 1 6 1 5 2 2 4 $ ProductGroup.y : Factor w/ 3 levels "BL","TEX","WL": 1 2 1 1 3 1 3 2 2 2 $ ProductGroupDesc.y : Factor w/ 3 levels "Backhoe Loaders",..: 1 2 1 1 3 1 3 2 2 2 $ MfgYear : num 2004 2003 1999 1991 1987 ... $ fiManufacturerID : Factor w/ 6 levels "103","121","25",..: 6 4 6 3 5 6 1 2 2 1 $ fiManufacturerDesc : Factor w/ 6 levels "Bobcat","Case",..: 5 4 5 2 3 5 6 1 1 6 $ PrimarySizeBasis : Factor w/ 3 levels "Horsepower","Standard Digging Depth - Ft",..: 2 3 2 2 1 2 1 3 3 3 $ PrimaryLower : int 14 4 14 14 350 14 225 3 3 40 $ PrimaryUpper : int 15 5 15 15 500 15 250 4 4 50 Dotted pair list of 12 $ : language (function() { load("./rmr-local-env9432cc02004") ... $ : language rmr2:::map.loop(map = map, keyval.reader = input.reader(), keyval.writer = if (is.null(reduce)) { output.writer() ... $ : language as.keyval(map(keys(kv), values(kv))) $ : language is.keyval(x) $ : language map(keys(kv), values(kv)) $ : language c.keyval(lapply(1:num.models, generate.sample)) $ : language f.single(args[[1]]) $ : language lapply(kvs, recycle.keyval) $ : language FUN(X[[4L]], ...) $ : language keyval(rmr.recycle(k, v), rmr.recycle(v, k)) $ : language rmr.recycle(k, v) $ : language rmr.str(lx) lx int 1 Dotted pair list of 12 $ : language (function() { load("./rmr-local-env9432cc02004") ... $ : language rmr2:::map.loop(map = map, keyval.reader = input.reader(), keyval.writer = if (is.null(reduce)) { output.writer() ... $ : language as.keyval(map(keys(kv), values(kv))) $ : language is.keyval(x) $ : language map(keys(kv), values(kv)) $ : language c.keyval(lapply(1:num.models, generate.sample)) $ : language f.single(args[[1]]) $ : language lapply(kvs, recycle.keyval) $ : language FUN(X[[4L]], ...) $ : language keyval(rmr.recycle(k, v), rmr.recycle(v, k)) $ : language rmr.recycle(k, v) $ : language rmr.str(ly) ly int 0 Error in rmr.recycle(k, v) : Can't recycle 0-length argument Calls: ... c.keyval -> f.single -> lapply -> FUN -> keyval -> rmr.recycle Execution halted java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:390) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) at org.apache.hadoop.mapred.Child$4.run(Child.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.Child.main(Child.java:260)

andrewmilkowski commented 11 years ago

@laserson

sorry I confused you a bit, the lines

transactions <- read.table(file="../downloads/train.csv",

nrows=1000,

nrows=20,

are coming from joinData.R , it is how I reduced number of samples to fitRandomForest.R

internally in rmr2 as is seen above in the trace exception: Error in rmr.recycle(k, v) : Can't recycle 0-length argument

is where the problem is...

laserson commented 11 years ago

Issue moved to rmr repo.