Closed yibochen closed 12 years ago
Thanks for your report, a fix is in 2.0.1. Let me tell you though that deliberately writing non-scalable code as you are is bad practice and the main reason why backend.parameter is deprecated. It could disappear as early as 2.1. I've never seen a good use of it except experimenting with formats.
On Fri, Oct 19, 2012 at 8:23 AM, yibo notifications@github.com wrote:
When I use the package rmr, I find it is useful to specify the number of mappers and reducers through the arguments 'backend.parameters'. But in rmr2, when I set this arguments, I can not get the desired results. I try to read the code and find it may be caused by the change of function 'paste.options'. for example, I set backend.parameters=list(hadoop=list(D='mapred.reduce.tasks=10', D='mapred.map.tasks=10')) in rmr, the streaming command is like
rmr:::paste.options(list(D='mapred.map.tasks=10',D='mapred.reduce.tasks=10')) [1] "-D mapred.map.tasks=10 -D mapred.reduce.tasks=10" while in rmr2, I get
rmr2:::paste.options(list(D='mapred.map.tasks=10',D='mapred.reduce.tasks=10')) [1] " - mapred.map.tasks=10 - mapred.reduce.tasks=10 " So in rmr2, how can I specify the number of mappers and reducers?Thank you.
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/143.
Does Hadoop really do that great a job guessing the number of map and reduce tasks? I've found it helpful to at least be able to specify the size of the chunks that get distributes (e.g. smaller for compute-intense tasks).
I am interested in anything that has legs beyond a single deployment and a single algorithm. I don't think Hadoop does a great job at guessing, it is mostly based on data size and doesn't use any information about the task, even after running a few tips (where an estimate of time/data becomes available) at least the last time I checked. I honestly don't see how smaller chunks would help for compute intensive tasks though. I think one needs chunks that take at least a few minutes, otherwise there is a overhead from task maintenance, and many enough to guarantee good utilization. That can be the number of cores or twice the number of cores on a cluster that's not doing anything else, more complicated when there are multiple users. Or it can be as big as 100X the number of cores for I/O bound jobs sich as web crawling, but I think that approach has been now supplanted by non-blocking I/O. So twice the number of cores but at least x MB per task is a good heuristic, but how do we implement it?
Antonio
On Thu, Oct 25, 2012 at 12:58 PM, jamiefolson notifications@github.comwrote:
Does Hadoop really do that great a job guessing the number of map and reduce tasks? I've found it helpful to at least be able to specify the size of the chunks that get distributes (e.g. smaller for compute-intense tasks).
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/143#issuecomment-9791837.
So it seems like you're saying it's helpful to be able tweak the number of tasks, but that the user shouldn't be doing it? Would this be something that should be an rmr.option? That way you're not specifying it for the task, but for the project/cluster configuration.
Yes I think I may have expressed that position in the past, but I didn't get much support from users who wanted to specify the number of tasks on a job by job basis. So we did nothing about it.
Antonio
On Thu, Oct 25, 2012 at 2:27 PM, jamiefolson notifications@github.comwrote:
So it seems like you're saying it's helpful to be able tweak the number of tasks, but that the user shouldn't be doing it? Would this be something that should be an rmr.option? That way you're not specifying it for the task, but for the project/cluster configuration.
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/143#issuecomment-9794823.
I remember now the reason was that I maintain that this needs to be fixed at the hadoop conf level. Some people were running with two reducer slots max and wanted to change that, but what they needed was just to change cluster administrators.
So in your experience/opinion, if the cluster is setup correctly at the Hadoop level, there's no need for RHadoop to mess with anything. Are there situations in which it would be necessary to have a Hadoop general config different from what you'd want to use to run RHadoop queries? Anything related to rmr io, memory, compute costs? It seems not unreasonable to think you might want to tweak settings for certain kinds of jobs if you know their costs.
Yes I can imagine some very memory intensive situations where one would want to go below defaults.
A
On Thu, Oct 25, 2012 at 4:58 PM, jamiefolson notifications@github.comwrote:
So in your experience/opinion, if the cluster is setup correctly at the Hadoop level, there's no need for RHadoop to mess with anything. Are there situations in which it would be necessary to have a Hadoop general config different from what you'd want to use to run RHadoop queries? Anything related to rmr io, memory, compute costs? It seems not unreasonable to think you might want to tweak settings for certain kinds of jobs if you know their costs.
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/143#issuecomment-9798767.
I recently recalled reading this relevant sigmod paper: http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
They use PIG to run jobs, but explicitly control the number of reduce jobs to control the number of models that are estimated from a pass through the data. It certainly seems like that kind of control might be desirable.
They should set the cardinality of keys instead. They are not acting at the right level of abstraction.
On Fri, Nov 16, 2012 at 10:18 AM, jamiefolson notifications@github.comwrote:
I recently recalled reading this relevant sigmod paper: http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
They use PIG to run jobs, but explicitly control the number of reduce jobs to control the number of models that are estimated from a pass through the data. It certainly seems like that kind of control might be desirable.
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/143#issuecomment-10456449.
I'm not entirely clear on how reduce and combine are distributed across tasks/threads. Some scalable models, e.g. stochastic gradient descent, are still estimated sequentially. Are you guaranteed that all values for the same key will be processed by the same reduce thread?
Jamie Olson
On Fri, Nov 16, 2012 at 1:26 PM, Antonio Piccolboni < notifications@github.com> wrote:
They should set the cardinality of keys instead. They are not acting at the right level of abstraction.
On Fri, Nov 16, 2012 at 10:18 AM, jamiefolson notifications@github.comwrote:
I recently recalled reading this relevant sigmod paper:
http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
They use PIG to run jobs, but explicitly control the number of reduce jobs to control the number of models that are estimated from a pass through the data. It certainly seems like that kind of control might be desirable.
— Reply to this email directly or view it on GitHub< https://github.com/RevolutionAnalytics/RHadoop/issues/143#issuecomment-10456449>.
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/143#issuecomment-10456699.
More than that, the same reduce call
Sent from a phone On Nov 16, 2012 11:14 AM, "jamiefolson" notifications@github.com wrote:
I'm not entirely clear on how reduce and combine are distributed across tasks/threads. Some scalable models, e.g. stochastic gradient descent, are still estimated sequentially. Are you guaranteed that all values for the same key will be processed by the same reduce thread?
Jamie Olson
On Fri, Nov 16, 2012 at 1:26 PM, Antonio Piccolboni < notifications@github.com> wrote:
They should set the cardinality of keys instead. They are not acting at the right level of abstraction.
On Fri, Nov 16, 2012 at 10:18 AM, jamiefolson notifications@github.comwrote:
I recently recalled reading this relevant sigmod paper:
http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
They use PIG to run jobs, but explicitly control the number of reduce jobs to control the number of models that are estimated from a pass through the data. It certainly seems like that kind of control might be desirable.
— Reply to this email directly or view it on GitHub<
https://github.com/RevolutionAnalytics/RHadoop/issues/143#issuecomment-10456449>.
— Reply to this email directly or view it on GitHub< https://github.com/RevolutionAnalytics/RHadoop/issues/143#issuecomment-10456699>.
— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/143#issuecomment-10458325.
When I use the package rmr, I find it is useful to specify the number of mappers and reducers through the arguments 'backend.parameters'. But in rmr2, when I set this arguments, I can not get the desired results. I try to read the code and find it may be caused by the change of function 'paste.options'. for example, I set backend.parameters=list(hadoop=list(D='mapred.reduce.tasks=10', D='mapred.map.tasks=10')) in rmr, the streaming command is like rmr:::paste.options(list(D='mapred.map.tasks=10',D='mapred.reduce.tasks=10')) [1] "-D mapred.map.tasks=10 -D mapred.reduce.tasks=10" while in rmr2, I get rmr2:::paste.options(list(D='mapred.map.tasks=10',D='mapred.reduce.tasks=10')) [1] " - mapred.map.tasks=10 - mapred.reduce.tasks=10 " So in rmr2, how can I specify the number of mappers and reducers?Thank you.