Open wolfxl opened 9 years ago
On Wed, Dec 24, 2014 at 12:51 PM, wolfxl notifications@github.com wrote:
I have a question about additional parameter. map() function require two parameters, keys and vals; Suppose in the map function, in addition to keys and vals, I need another dataset, say dataX , which can be located in hdfs or local machine, how can I pass dataX to datanodes?
The structure of my current code looks like this. I found that datanodes can load dataX from local disk in this way. Is this the practice to pass additional parameter in this way?
Yes
How did datanode know dataX, I feel like they are not in the save namespace.
We felt like the normal scoping rules should be upheld to the possible extent (read only, which is the one you need here). The variables in scope are serialized an broadcasted to the nodes using an efficient broadcasting mechanisms offered by Hadoop Mapreduce. It should be superior to reading a separate file from hdfs (other than the main program input).
As for data in the hdfs, how can I deal with it? Many thanks!
dataX=read.csv(XXX)
igraphFrame_map=function(k, lines2) { ... var =dataX .... return(keyval(mykey, 1))
}
igraphFrame_reduce=function(k, v) { ... keyval(k', v')
}
igraphFrame=function(input, output) { mapreduce(input=input, output=output, map=igraphFrame_map, reduce=igraphFrame_reduce) }
— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/154.
Many thanks!! Antonio, That's very helpful. So now I understood that if I have a dataset which is located in local disc, I can directly it send to different datanodes in the way mentioned above. But what if I have data which is in the hdfs and I need to use them as additional parameters in mapreduce, do I have to read them from hdfs to local, such as dataX= from.dfs("directory on hdfs"), and send them again to datanode? I feel like it will take a lot of time to read to local.
If not, how do I refer to the data (in the hdfs) in the mapper or reducer functions (as additional parameters in addition to the default key and val)?
Thanks so much!
I think your main alternative is a join (called equijoin in rmr2).
On Sat, Dec 27, 2014 at 4:18 PM, wolfxl notifications@github.com wrote:
Many thanks!! Antonio, That's very helpful. So now I understood that if I have a dataset which is located in local disc, I can directly it send to different datanodes in the way mentioned above. But what if I have data which is in the hdfs and I need to use them as additional parameters in mapreduce, do I have to read them from hdfs to local, such as dataX= from.dfs("directory on hdfs"), and send them again to datanode? I feel like it will take a lot of time to read to local.
If not, how do I refer to the data (in the hdfs) in the mapper or reducer functions (as additional parameters in addition to the default key and val)?
Thanks so much!
— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/rmr2/issues/154#issuecomment-68194398 .
I have a question about additional parameter. map() function require two parameters, keys and vals; Suppose in the map function, in addition to keys and vals, I need another dataset, say dataX , which can be located in hdfs or local machine, how can I pass dataX to datanodes?
The structure of my current code looks like this. I found that datanodes can load dataX from local disk in this way. Is this the practice to pass additional parameter in this way? How did datanode know dataX, I feel like they are not in the save namespace. As for data in the hdfs, how can I deal with it? Many thanks!
dataX=read.csv(XXX)
igraphFrame_map=function(k, lines2) { ... var =dataX .... return(keyval(mykey, 1))
}
igraphFrame_reduce=function(k, v) { ... keyval(k', v')
}
igraphFrame=function(input, output) { mapreduce(input=input, output=output, map=igraphFrame_map, reduce=igraphFrame_reduce) }