RevolutionAnalytics / rmr2

A package that allows R developer to use Hadoop MapReduce
160 stars 149 forks source link

Cross reference issues in the project that use rmr2 as base library in M/R to solve Poisson approximation to enable to train a random forest #68

Open andrewmilkowski opened 10 years ago

andrewmilkowski commented 10 years ago

https://github.com/cloudera/poisson_sampling/issues?state=open

but I believe of relevance might be

"reducer task stuck at 67%"

https://github.com/cloudera/poisson_sampling/issues/3

latest rmr2 was used installed via

git clone git://github.com/RevolutionAnalytics/rmr2.git R CMD INSTALL --byte-compile rmr2/pkg/

thanks and regards!

piccolbo commented 10 years ago

Hi I would like to refer you to our debugging guide, in particular can you reproduce the problem on the local backend and can you inspect standard error of a failing reduce attempt? Please let us know what you find out.

Antonio On Sep 29, 2013 10:35 AM, "Andrew Milkowski" notifications@github.com wrote:

https://github.com/cloudera/poisson_sampling/issues?state=open

but I believe of relevance might be

"reducer task stuck at 67%"

cloudera/poisson_sampling#3https://github.com/cloudera/poisson_sampling/issues/3

latest rmr2 was used installed via

git clone git://github.com/RevolutionAnalytics/rmr2.git R CMD INSTALL --byte-compile rmr2/pkg/

thanks and regards!

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/rmr2/issues/68 .

andrewmilkowski commented 10 years ago

Hi, thanks for the feedback

will follow: https://github.com/RevolutionAnalytics/RHadoop/wiki/Debugging-rmr-programs

but if you have something that comes to mind do let know!, we are seriously exploring rmr2 for PM...

job stays on 68% (it is ultra slow!) is it because I am running singular machine test on pseudo cluster ? would real test on real cluster be more appropriate ?

for now just reporting that nothing unusual in the stderr (shows all as it should be library loaded nicely)

Loading required package: randomForest randomForest 4.6-7 Type rfNews() to see new features/changes/bug fixes. Loading required package: rmr2 Loading required package: Rcpp Loading required package: RJSONIO Loading required package: methods Loading required package: bitops Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr Loading required package: reshape2

and syslog (from task tracker) gets stuck doing

2013-09-30 11:06:33,444 INFO org.apache.hadoop.mapred.ReduceTask: Merging 3 files, 1038419189 bytes from disk 2013-09-30 11:06:33,444 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce 2013-09-30 11:06:33,446 INFO org.apache.hadoop.mapred.Merger: Merging 3 sorted segments 2013-09-30 11:06:33,470 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 1038419177 bytes 2013-09-30 11:06:33,500 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2013-09-30 11:06:33,501 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/usr/bin/Rscript, ./rmr-streaming-reduce13f010a0e9ab] 2013-09-30 11:06:33,564 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s] 2013-09-30 11:07:45,650 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=10/0/0 in:0=10/72 [rec/s] out:0=0/72 [rec/s]

while checking top shows R task at 97% 99% CPU usage...

5730 mapred 20 0 968m 782m 4416 R 96.3 20.4 5:23.51 R

job stays on 68% (it is ultra slow!) r

[amilkowski@localhost src]$ tail -f fitRandomForest.Rout 13/09/30 11:05:49 INFO streaming.StreamJob: map 77% reduce 22% 13/09/30 11:05:55 INFO streaming.StreamJob: map 79% reduce 22% 13/09/30 11:05:58 INFO streaming.StreamJob: map 82% reduce 22% 13/09/30 11:06:01 INFO streaming.StreamJob: map 85% reduce 22% 13/09/30 11:06:04 INFO streaming.StreamJob: map 87% reduce 22% 13/09/30 11:06:07 INFO streaming.StreamJob: map 90% reduce 22% 13/09/30 11:06:13 INFO streaming.StreamJob: map 92% reduce 22% 13/09/30 11:06:16 INFO streaming.StreamJob: map 95% reduce 22% 13/09/30 11:06:19 INFO streaming.StreamJob: map 97% reduce 22% 13/09/30 11:06:22 INFO streaming.StreamJob: map 100% reduce 22% 13/09/30 11:06:37 INFO streaming.StreamJob: map 100% reduce 67% 13/09/30 11:11:59 INFO streaming.StreamJob: map 100% reduce 68%

andrewmilkowski commented 10 years ago

running test.. as I write this email, reduce merge sort moved to 69% ...

13/09/30 11:16:08 INFO streaming.StreamJob: map 100% reduce 69%

and..

13/09/30 11:24:30 INFO streaming.StreamJob: map 100% reduce 70% 13/09/30 11:30:09 INFO streaming.StreamJob: map 100% reduce 71

time delta on the upper right corner shows progression...

these are the culprits...

2013-09-30 11:19:55,745 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=136/2 2013-09-30 11:24:24,724 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=180/3 2013-09-30 11:30:02,760 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=218/4 2013-09-30 11:36:02,296 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=268/5

wonder if this is R/streaming general performance issues (there is only one reducer...)

ok.. need to perform deeper local backend test as suggested, M/R restarted reducer (very strange) back to 0% then crawl again

13/09/30 11:40:55 INFO streaming.StreamJob: map 100% reduce 72% 13/09/30 11:46:10 INFO streaming.StreamJob: map 100% reduce 73% 13/09/30 12:12:53 INFO streaming.StreamJob: map 100% reduce 0% 13/09/30 12:13:06 INFO streaming.StreamJob: map 100% reduce 22% 13/09/30 12:13:12 INFO streaming.StreamJob: map 100% reduce 67%

restart is due to the timeout, in tacktracker logs... M/R is retrying as per design, R mapred function is taking too long...

2013-09-30 12:12:52,907 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201309301101_0001_r_000000_0: Task attempt_201309301101_0001_r_000000_0 failed to report status for 1599 seconds. Killing! 2013-09-30 12:12:52,957 INFO org.apache.hadoop.mapred.TaskTracker: Process Thread Dump: lost task 33 active threads Thread 110 (IPC Client (47) connection to localhost/127.0.0.1:8021 from mapred): State: TIMED_WAITING Blocked count: 1 Waited count: 1 Stack: java.lang.Object.wait(Native Method) org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:723) org.apache.hadoop.ipc.Client$Connection.run(Client.java:766) Thread 98 (Readahead Thread #3): State: WAITING Blocked count: 0 Waited count: 124 Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@425113f0

piccolbo commented 10 years ago

Since the reduce function is almost a one-liner, I would try and make sure the arguments are the size you expect them to be. Just before the RandomForest call add a

rmr.str(v)

This will print some information to stderr, first some stack information to understand where it was called (this is useful in more complex situations) then a short version of the contents of v. You need to locate stderr in the logs or switch to the local backend (then stderr will appear in console). On the local backend the input should be a file on the local file system, not on HDFS. One way or another you need to add some of the debugging techniques to your toolset. I wouldn't be able to complete any but the simplest of programs without them. Another thing would be to test randomForest separately, independent of rmr2, to get a sense of how fast it is for different input sizes.

Antonio

On Mon, Sep 30, 2013 at 8:17 AM, Andrew Milkowski notifications@github.comwrote:

running test.. as I write this email, reduce merge sort moved to 69% ...

13/09/30 11:16:08 INFO streaming.StreamJob: map 100% reduce 69%

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/rmr2/issues/68#issuecomment-25371547 .

andrewmilkowski commented 10 years ago

Antonio, this makes perfect sense, thank you, I have already started on some serious debugging/profiling (random forest by itself is already memory hungry)

currently increased some of the parameters, my under power mac lives in a limited CPU/memory space but I like to test models like this, if they behave well on a small setup they will do even better on the large cluster..

will let you know... and thanks

mapreduce(input="/poisson/training.csv", backend.parameters=list(hadoop=list(D="mapred.task.timeout=3600000", D="mapred.job.name=fitRandomForest", D="mapred.reduce.tasks=1",

D="mapred.map.child.java.opts=-Xmx2024m",

                                 #D="mapred.reduce.child.java.opts=-Xmx2024m",
                                 #D="io.sort.mb=1500",
                                 D="io.sort.mb=64",
                                 D="io.sort.factor=150",
                                 D="io.sort.spill.percent=0.99",
                                 D="mapred.job.reduce.input.buffer.percent=0.7",
                                 D="mapred.immem.merge.threshold=0",
                                 D="mapred.job.shuffle.merge.percent=0.8"
                                #D="mapred.task.profile=true",
                                #D="mapred.task.profile.params=-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=hprof1.txt",
                                #D="mapred.task.profile.maps=0-2"
                                ),local=list()),
           input.format=bulldozer.input.format,
           map=poisson.subsample,
           reduce=fit.trees,
           output="/poisson/output")
andrewmilkowski commented 10 years ago

ayy, I might be running to the same exact issue as this gentleman

https://groups.google.com/forum/#!msg/rhadoop/KAo1hYM-l04/sPMiVYAxK_kJ

also, Random Forest requires one final reducer... so I ask myself where will be the optimization part.. for large input set with a single reducer... that is the real problem...

piccolbo commented 10 years ago

I doubt it. The author of the program you started from knows what he is doing, you can't compare his program with this other program that is incorrect in so many ways. The cloudera program has multiple reducer calls so it doesn't "require" a single reducer under some reasonable interpretation. Yours may be different, but then your program is sequential and there is no point in using Hadoop. random forests may not be best suited for mapreduce because the resampled data sets are still a big fraction of the original. Take a look at the " Bag of Little Bootstraps" paper if your data set is really big.

Antonio

On Mon, Sep 30, 2013 at 10:50 AM, Andrew Milkowski <notifications@github.com

wrote:

ayy, I might be running to the same exact issue as this gentleman

https://groups.google.com/forum/#!msg/rhadoop/KAo1hYM-l04/sPMiVYAxK_kJ

also, Random Forest requires one final reducer... so I ask myself where will be the optimization part.. for large input set with a single reducer... that is the real problem...

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/rmr2/issues/68#issuecomment-25385867 .

andrewmilkowski commented 10 years ago

yep... makes total sense again, this was an important lesson, killing the test before test kills me (or burn my mac's motherboard)

so yes... true.. might not be suitable, will try something here: rjungler

https://r-forge.r-project.org/projects/rjungler/

btw where do you see in cloudera program multiple reducers

this function

REDUCE function

fit.trees <- function(k, v) {

rmr rbinds the emited values, so v is a dataframe

note that do.trace=T is used to produce output to stderr to keep

the reduce task from timing out

rf <- randomForest(formula=model.formula, data=v, na.action=na.roughfix, ntree=10, do.trace=FALSE)

rf is a list so wrap it in another list to ensure that only

one object gets emitted. this is because keyval is vectorized

keyval(k, list(forest=rf)) }

gets invoked only once in reducer call.. and even though I tried to increas number of reducers from 1 to 3 D="mapred.reduce.tasks=3" ,internally only one R task (reducer was created)

I could in fact invoke Revo64 Revoscript Revolution MKL build R , but this just buys me multi core execution :) I am looking for massive scalable solution cross let say 1000 machines all at once

-Andrew

piccolbo commented 10 years ago

On Mon, Sep 30, 2013 at 11:18 AM, Andrew Milkowski <notifications@github.com

wrote:

yep... makes total sense again, this was an important lesson, killing the test before test kills me (or burn my mac's motherboard)

so yes... true.. might not be suitable, will try something here: rjungler

https://r-forge.r-project.org/projects/rjungler/

btw where do you see in cloudera program multiple reducers

Because of the keys returned by the map function. Other example you mentioned only has NULL key

this function REDUCE function

fit.trees <- function(k, v) {

rmr rbinds the emited values, so v is a dataframe

note that do.trace=T is used to produce output to stderr to keep

the reduce task from timing out

rf <- randomForest(formula=model.formula, data=v, na.action=na.roughfix, ntree=10, do.trace=FALSE)

rf is a list so wrap it in another list to ensure that only

one object gets emitted. this is because keyval is vectorized

keyval(k, list(forest=rf)) }

gets invoked only once in reducer call.

The reduce function is called once per distinct key value. How many processes are used to execute those calls is under hadoop control, see next point. Ditto for how many nodes are used to execute those processes, that is Hadoop-contolled.

. and even though I tried to increas number of reducers from 1 to 3

D="mapred.reduce.tasks=3" ,internally only one R task (reducer was created)

Either there is a bug I couldn't see such that all the keys are the same or this is strange, I think the configuration parameter you are modifying is the correct one.

I could in fact invoke Revo64 Revoscript Revolution MKL build R , but this just buys me multi core execution :) I am looking for massive scalable solution cross let say 1000 machines all at once

Good point.

Antonio

-Andrew

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/rmr2/issues/68#issuecomment-25388442 .

andrewmilkowski commented 10 years ago

Antonio to be precise setting (D="mapred.reduce.tasks=10")

caused 2 reducer tasks to be created (2 R processes) this is still way too small of the number, it won't scale...

24497 mapred 20 0 667m 481m 4284 R 97.8 12.6 4:58.21 R
24492 mapred 20 0 667m 481m 4284 R 97.1 12.6 4:43.82 R

this is hence looking less and less as rmr2 but a combination of theoretical underpinnings of random forest / input data entropy/structure pretty much what you have said

thanks again, and will be following rmr2 close

-A

piccolbo commented 10 years ago

Well, hadoop may not be configured with enough reduce slots. In the mapred-site.xml configuration file check mapred.tasktracker.reduce.tasks.maximum. I suspect you need a restart after changing that parameter for it to take effect. In general, look into hadoop documentation for "number of reduce tasks" or some such. We would love to make these issues totally transparent for rmr2 users, but it's not an easy task. Hadoop needs to be configured right and there are many knobs. Configuring Hadoop for a variety of loads is difficult. I don't think it's something we (RHadoop team) can realistically take on, nor we can provide Hadoop sysadmin for every installation of rmr2, but if people ask in this forum I or somebody else will try to help. Nonetheless there would probably be more in depth know how on a Hadoop forum.

Antonio

On Mon, Sep 30, 2013 at 12:20 PM, Andrew Milkowski <notifications@github.com

wrote:

Antonio to be precise setting (D="mapred.reduce.tasks=10")

caused 2 reduce tasks to be created (2 R processes) this is still way too small of the numbers, it won't scale...

24497 mapred 20 0 667m 481m 4284 R 97.8 12.6 4:58.21 R

24492 mapred 20 0 667m 481m 4284 R 97.1 12.6 4:43.82 R

-A

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/rmr2/issues/68#issuecomment-25394871 .

andrewmilkowski commented 10 years ago

Antonio, this is very likely environment issue, I have two distros on my machine (one of them hadoop-2.0.0-cdh4.4.0 with yarn scheduler, I know this is something rmr2 does not support yet..) I should have known better and will check my configuration, also thanks for help on the other problem regarding small input data, please don't spend any more time on this... I will return with the feedback... and all of this can be closed... thanks for the great library!

andrewmilkowski commented 10 years ago

yes, definitely, please close this ticket... fundamental user error

specifying in

mapred-site.xml


<property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>4</value>
    <description>The maximum number of map tasks that will be run simultaneously by a task tracker.</description>
  </property>

  <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>10</value>
    <description>The maximum number of reduce tasks that will be run simultaneously by a task tracker.</description>
  </property>

creates (up to) 10 reducers and in the sample of my run I specified 3 reducers (hence 3 R processes), although for a test run on a singular machine this is just not appropriate.

Swap: 3260408k total, 628824k used, 2631584k free, 226596k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12662 mapred 20 0 667m 480m 4284 R 93.8 12.6 2:59.71 R
12702 mapred 20 0 662m 475m 4284 R 49.2 12.4 2:09.04 R
12694 mapred 20 0 819m 632m 4392 R 48.9 16.5 2:04.52 R