RevolutionAnalytics / RHadoop

RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki
763 stars 278 forks source link

possible to create a figure (or write to a file) in the reducer? #191

Closed ywen2000 closed 11 years ago

ywen2000 commented 11 years ago

The input to my reducer is (id, data). I would like to create a pdf figure for each id (namely each reducer) on the hdfs. I tried to use hdfs.file to create a file, but it crashed. Any good ideas?

Also is it possible to re-name the hadoop output file to something other than part-00000 in the mapper or reducer?

Another thing I noticed is that if reducer is specified, any print statement in the mapper would crash the code. Any explanation on this?

One more side question: does the function "keyval" call "return" inside?

Thanks!

piccolbo commented 11 years ago

On Jun 7, 2013 3:58 PM, "ywen2000" notifications@github.com wrote:

The input to my reducer is (id, data). I would like to create a pdf figure for each id (namely each reducer) on the hdfs. I tried to use hdfs.file to create a file, but it crashed. Any good ideas?

I don't know if it is a good idea but the way I would go about it would be to write to a local file, read it back in with read bin and then return the raw vector as value. If there is a binary equivalent of a text connection you could try to avoid b the temporary file.

Also is it possible to re-name the hadoop output file to something other than part-00000 in the mapper or reducer?

There may be a very convoluted one but not as a supported feature.

Another thing I noticed is that if reducer is specified, any print statement in the mapper would crash the code. Any explanation on this?

Yes, it's in the manual. Use stderr instead or rmr.str

Antonio

Thanks!

— Reply to this email directly or view it on GitHub.

piccolbo commented 11 years ago

On Jun 7, 2013 5:23 PM, "Antonio Piccolboni" antonio@piccolboni.info wrote:

On Jun 7, 2013 3:58 PM, "ywen2000" notifications@github.com wrote:

The input to my reducer is (id, data). I would like to create a pdf figure for each id (namely each reducer) on the hdfs. I tried to use hdfs.file to create a file, but it crashed. Any good ideas?

I don't know if it is a good idea but the way I would go about it would be to write to a local file, read it back in with read bin and then return the raw vector as value. If there is a binary equivalent of a text connection you could try to avoid b the temporary file.

I meant list(raw vector) as value.

Also is it possible to re-name the hadoop output file to something other than part-00000 in the mapper or reducer?

There may be a very convoluted one but not as a supported feature.

Another thing I noticed is that if reducer is specified, any print statement in the mapper would crash the code. Any explanation on this?

Yes, it's in the manual. Use stderr instead or rmr.str

Antonio

Thanks!

— Reply to this email directly or view it on GitHub.

ywen2000 commented 11 years ago

Thanks. Where can I download the manual?

I need some suggestion in how to debug the code. I saw someone mention running the hadoop in standalone mode. Do you know how to do that? What is the difference between the standalone mode and the normal mode of hadoop?

piccolbo commented 11 years ago

On Jun 7, 2013 5:52 PM, "ywen2000" notifications@github.com wrote:

Thanks. Where can I download the manual?

It's installed with the package. Are you familiar with R help?

I need some suggestion in how to debug the code.

See the debugging guide in the wiki

I saw someone mention running the hadoop in standalone mode. Do you know how to do that?

I would consult the manual of your Hadoop distribution. Rmr has no say on that.

What is the difference between the standalone mode and the normal mode of hadoop?

This is a general Hadoop question and you'll be better served by asking it in the appropriate forum. It's covered in any Hadoop tutorial I've ever run into.

— Reply to this email directly or view it on GitHub.

ywen2000 commented 11 years ago

Thanks. Back to the original question. How would I even write to a local file system in the reducer? It seems that it tried to write to HDFS and then crashed. What is the command I should use?

I appreciate your help.

On Fri, Jun 7, 2013 at 12:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

On Jun 7, 2013 5:52 PM, "ywen2000" notifications@github.com wrote:

Thanks. Where can I download the manual?

It's installed with the package. Are you familiar with R help?

I need some suggestion in how to debug the code.

See the debugging guide in the wiki

I saw someone mention running the hadoop in standalone mode. Do you know how to do that?

I would consult the manual of your Hadoop distribution. Rmr has no say on that.

What is the difference between the standalone mode and the normal mode of hadoop?

This is a general Hadoop question and you'll be better served by asking it in the appropriate forum. It's covered in any Hadoop tutorial I've ever run into.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19119182 .

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

piccolbo commented 11 years ago

On Jun 7, 2013 8:26 PM, "ywen2000" notifications@github.com wrote:

Thanks. Back to the original question. How would I even write to a local file system in the reducer?

Try the normal way using the current directory, like png(file name)

It seems that it tried to write to HDFS and then crashed. What is the command I should use?

I would be cautious about generalizing from a crash we don't understand yet. Writing to a temporary local file it's a pretty safe thing to do. Writing in parallel to a distributed file is a completely different thing. If you were writing in java you would use things like multiple output format but this feature is not available in rmr2. Rhdfs was never meant for parallel writes.

I appreciate your help

Happy to help

Antonio

On Fri, Jun 7, 2013 at 12:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

On Jun 7, 2013 5:52 PM, "ywen2000" notifications@github.com wrote:

Thanks. Where can I download the manual?

It's installed with the package. Are you familiar with R help?

I need some suggestion in how to debug the code.

See the debugging guide in the wiki

I saw someone mention running the hadoop in standalone mode. Do you know how to do that?

I would consult the manual of your Hadoop distribution. Rmr has no say on that.

What is the difference between the standalone mode and the normal mode of hadoop?

This is a general Hadoop question and you'll be better served by asking it in the appropriate forum. It's covered in any Hadoop tutorial I've ever run into.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub< https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19119182>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

— Reply to this email directly or view it on GitHub.

ywen2000 commented 11 years ago

I found it was a permission problem. My account name for the local system is "hduser" and when I ran mapreduce, everything was under "mapred". No wonder I could not access hduser's local file system. I was able to achieve writing to a particular folder via "chmod a+w". However, it doesn't look like a good solution. Anyway, I could at least write to a file system now.

I guess the next step would be uploading it to HDFS, as you suggested. I will try and report back. Thanks for your tips.

On Sat, Jun 8, 2013 at 2:10 AM, Antonio Piccolboni <notifications@github.com

wrote:

On Jun 7, 2013 8:26 PM, "ywen2000" notifications@github.com wrote:

Thanks. Back to the original question. How would I even write to a local file system in the reducer?

Try the normal way using the current directory, like png(file name)

It seems that it tried to write to HDFS and then crashed. What is the command I should use?

I would be cautious about generalizing from a crash we don't understand yet. Writing to a temporary local file it's a pretty safe thing to do. Writing in parallel to a distributed file is a completely different thing. If you were writing in java you would use things like multiple output format but this feature is not available in rmr2. Rhdfs was never meant for parallel writes.

I appreciate your help

Happy to help

Antonio

On Fri, Jun 7, 2013 at 12:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

On Jun 7, 2013 5:52 PM, "ywen2000" notifications@github.com wrote:

Thanks. Where can I download the manual?

It's installed with the package. Are you familiar with R help?

I need some suggestion in how to debug the code.

See the debugging guide in the wiki

I saw someone mention running the hadoop in standalone mode. Do you know how to do that?

I would consult the manual of your Hadoop distribution. Rmr has no say on that.

What is the difference between the standalone mode and the normal mode of hadoop?

This is a general Hadoop question and you'll be better served by asking it in the appropriate forum. It's covered in any Hadoop tutorial I've ever run into.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19119182>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19143783 .

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

piccolbo commented 11 years ago

On Jun 10, 2013 3:09 PM, "ywen2000" notifications@github.com wrote:

I found it was a permission problem. My account name for the local system is "hduser" and when I ran mapreduce, everything was under "mapred". No wonder I could not access hduser's local file system. I was able to achieve writing to a particular folder via "chmod a+w". However, it doesn't look like a good solution. Anyway, I could at least write to a file system now.

I guess the next step would be uploading it to HDFS, as you suggested. I Just to clarify, I did not suggest to write to HDFS directly. Just return the data from the reducer. If you need the images in separate files then I am not sure.

Antonio

will try and report back. Thanks for your tips.

On Sat, Jun 8, 2013 at 2:10 AM, Antonio Piccolboni < notifications@github.com

wrote:

On Jun 7, 2013 8:26 PM, "ywen2000" notifications@github.com wrote:

Thanks. Back to the original question. How would I even write to a local file system in the reducer?

Try the normal way using the current directory, like png(file name)

It seems that it tried to write to HDFS and then crashed. What is the command I should use?

I would be cautious about generalizing from a crash we don't understand yet. Writing to a temporary local file it's a pretty safe thing to do. Writing in parallel to a distributed file is a completely different thing. If you were writing in java you would use things like multiple output format but this feature is not available in rmr2. Rhdfs was never meant for parallel writes.

I appreciate your help

Happy to help

Antonio

On Fri, Jun 7, 2013 at 12:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

On Jun 7, 2013 5:52 PM, "ywen2000" notifications@github.com wrote:

Thanks. Where can I download the manual?

It's installed with the package. Are you familiar with R help?

I need some suggestion in how to debug the code.

See the debugging guide in the wiki

I saw someone mention running the hadoop in standalone mode. Do you know how to do that?

I would consult the manual of your Hadoop distribution. Rmr has no say on that.

What is the difference between the standalone mode and the normal mode of hadoop?

This is a general Hadoop question and you'll be better served by asking it in the appropriate forum. It's covered in any Hadoop tutorial I've ever run into.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19119182>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub< https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19143783>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

Reply to this email directly or view it on GitHub.

ywen2000 commented 11 years ago

I want to generate a file on hdfs in each reducer. I thought I could do the following in each reducer { create file on local file system hdfs.put(local_file_path, hdfs_path) } However, I got the following error:

Error in .hdfsCopy(src = src, dest = dest, srcFS = srcFS, dstFS = dstFS, : attempt to apply non-function Calls: ... is.keyval -> reduce -> hdfs.put -> hdfs.copy -> .hdfsCopy Execution halted

I tried the some hdfs.put command in the command line and it worked. I used the same arguments. It looks like hdfs.put and other rhdfs commands cannot be used in reducer to create/write to files in hdfs. Does it mean the parallel write is not supported in rmr2?

On Mon, Jun 10, 2013 at 9:55 AM, Antonio Piccolboni < notifications@github.com> wrote:

On Jun 10, 2013 3:09 PM, "ywen2000" notifications@github.com wrote:

I found it was a permission problem. My account name for the local system is "hduser" and when I ran mapreduce, everything was under "mapred". No wonder I could not access hduser's local file system. I was able to achieve writing to a particular folder via "chmod a+w". However, it doesn't look like a good solution. Anyway, I could at least write to a file system now.

I guess the next step would be uploading it to HDFS, as you suggested. I Just to clarify, I did not suggest to write to HDFS directly. Just return the data from the reducer. If you need the images in separate files then I am not sure.

Antonio

will try and report back. Thanks for your tips.

On Sat, Jun 8, 2013 at 2:10 AM, Antonio Piccolboni < notifications@github.com

wrote:

On Jun 7, 2013 8:26 PM, "ywen2000" notifications@github.com wrote:

Thanks. Back to the original question. How would I even write to a local file system in the reducer?

Try the normal way using the current directory, like png(file name)

It seems that it tried to write to HDFS and then crashed. What is the command I should use?

I would be cautious about generalizing from a crash we don't understand yet. Writing to a temporary local file it's a pretty safe thing to do. Writing in parallel to a distributed file is a completely different thing. If you were writing in java you would use things like multiple output format but this feature is not available in rmr2. Rhdfs was never meant for parallel writes.

I appreciate your help

Happy to help

Antonio

On Fri, Jun 7, 2013 at 12:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

On Jun 7, 2013 5:52 PM, "ywen2000" notifications@github.com wrote:

Thanks. Where can I download the manual?

It's installed with the package. Are you familiar with R help?

I need some suggestion in how to debug the code.

See the debugging guide in the wiki

I saw someone mention running the hadoop in standalone mode. Do you know how to do that?

I would consult the manual of your Hadoop distribution. Rmr has no say on that.

What is the difference between the standalone mode and the normal mode of hadoop?

This is a general Hadoop question and you'll be better served by asking it in the appropriate forum. It's covered in any Hadoop tutorial I've ever run into.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19119182>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19143783>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19199775 .

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

piccolbo commented 11 years ago

Since rmr2 functions accept functions as arguments, your argument would imply that we need to support anything any user puts in those functions, including all cran packages. I am sorry but that is not going to happen. You are supposed to return your data from the reduce function and Hadoop will write it out in a parallel fashion. If you want to do it your way, you have the right to try but you are on your own. It should work but with horrible performance. The reduce function still needs to return a vector, data frame, matrix or list. I suggested a different approach but you seem to ignore it, so I am not sure I can be of any help. The reduce function should return keyval(image id, list(image data)). Don't use rhdfs in a reduce. Please try this and let me know.

Antonio On Jun 10, 2013 4:04 PM, "ywen2000" notifications@github.com wrote:

I want to generate a file on hdfs in each reducer. I thought I could do the following in each reducer { create file on local file system hdfs.put(local_file_path, hdfs_path) } However, I got the following error:

Error in .hdfsCopy(src = src, dest = dest, srcFS = srcFS, dstFS = dstFS, : attempt to apply non-function Calls: ... is.keyval -> reduce -> hdfs.put -> hdfs.copy -> .hdfsCopy Execution halted

I tried the some hdfs.put command in the command line and it worked. I used the same arguments. It looks like hdfs.put and other rhdfs commands cannot be used in reducer to create/write to files in hdfs. Does it mean the parallel write is not supported in rmr2?

On Mon, Jun 10, 2013 at 9:55 AM, Antonio Piccolboni < notifications@github.com> wrote:

On Jun 10, 2013 3:09 PM, "ywen2000" notifications@github.com wrote:

I found it was a permission problem. My account name for the local system is "hduser" and when I ran mapreduce, everything was under "mapred". No wonder I could not access hduser's local file system. I was able to achieve writing to a particular folder via "chmod a+w". However, it doesn't look like a good solution. Anyway, I could at least write to a file system now.

I guess the next step would be uploading it to HDFS, as you suggested. I Just to clarify, I did not suggest to write to HDFS directly. Just return the data from the reducer. If you need the images in separate files then I am not sure.

Antonio

will try and report back. Thanks for your tips.

On Sat, Jun 8, 2013 at 2:10 AM, Antonio Piccolboni < notifications@github.com

wrote:

On Jun 7, 2013 8:26 PM, "ywen2000" notifications@github.com wrote:

Thanks. Back to the original question. How would I even write to a local file system in the reducer?

Try the normal way using the current directory, like png(file name)

It seems that it tried to write to HDFS and then crashed. What is the command I should use?

I would be cautious about generalizing from a crash we don't understand yet. Writing to a temporary local file it's a pretty safe thing to do. Writing in parallel to a distributed file is a completely different thing. If you were writing in java you would use things like multiple output format but this feature is not available in rmr2. Rhdfs was never meant for parallel writes.

I appreciate your help

Happy to help

Antonio

On Fri, Jun 7, 2013 at 12:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

On Jun 7, 2013 5:52 PM, "ywen2000" notifications@github.com wrote:

Thanks. Where can I download the manual?

It's installed with the package. Are you familiar with R help?

I need some suggestion in how to debug the code.

See the debugging guide in the wiki

I saw someone mention running the hadoop in standalone mode. Do you know how to do that?

I would consult the manual of your Hadoop distribution. Rmr has no say on that.

What is the difference between the standalone mode and the normal mode of hadoop?

This is a general Hadoop question and you'll be better served by asking it in the appropriate forum. It's covered in any Hadoop tutorial I've ever run into.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19119182>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19143783>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub< https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19199775>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19200342 .

ywen2000 commented 11 years ago

Thank you for pointing me into correct directions. I also felt using rhdfs in reducer was not a good idea. I just wanted to confirm with you.

I think your suggestion should work and I will give an update later. I have a follow-up question for you. If I have multiple return values for the same key, say, keyval(id, list(data1,data2,data3)), how could I post-process it after the mapreduce job? I mean the data could be of various types. For example, data1 could be a date frame, data2 could be an image, etc.. The mapreduce job only emits one output file. Do you suggest loading the file into RAM and process it or using it as an input to a 2nd mapreduce job? I guess it all depends on the file size?

Thanks!

On Mon, Jun 10, 2013 at 12:50 PM, Antonio Piccolboni < notifications@github.com> wrote:

Since rmr2 functions accept functions as arguments, your argument would imply that we need to support anything any user puts in those functions, including all cran packages. I am sorry but that is not going to happen. You are supposed to return your data from the reduce function and Hadoop will write it out in a parallel fashion. If you want to do it your way, you have the right to try but you are on your own. It should work but with horrible performance. The reduce function still needs to return a vector, data frame, matrix or list. I suggested a different approach but you seem to ignore it, so I am not sure I can be of any help. The reduce function should return keyval(image id, list(image data)). Don't use rhdfs in a reduce. Please try this and let me know.

Antonio On Jun 10, 2013 4:04 PM, "ywen2000" notifications@github.com wrote:

I want to generate a file on hdfs in each reducer. I thought I could do the following in each reducer { create file on local file system hdfs.put(local_file_path, hdfs_path) } However, I got the following error:

Error in .hdfsCopy(src = src, dest = dest, srcFS = srcFS, dstFS = dstFS, : attempt to apply non-function Calls: ... is.keyval -> reduce -> hdfs.put -> hdfs.copy -> .hdfsCopy Execution halted

I tried the some hdfs.put command in the command line and it worked. I used the same arguments. It looks like hdfs.put and other rhdfs commands cannot be used in reducer to create/write to files in hdfs. Does it mean the parallel write is not supported in rmr2?

On Mon, Jun 10, 2013 at 9:55 AM, Antonio Piccolboni < notifications@github.com> wrote:

On Jun 10, 2013 3:09 PM, "ywen2000" notifications@github.com wrote:

I found it was a permission problem. My account name for the local system is "hduser" and when I ran mapreduce, everything was under "mapred". No wonder I could not access hduser's local file system. I was able to achieve writing to a particular folder via "chmod a+w". However, it doesn't look like a good solution. Anyway, I could at least write to a file system now.

I guess the next step would be uploading it to HDFS, as you suggested. I Just to clarify, I did not suggest to write to HDFS directly. Just return the data from the reducer. If you need the images in separate files then I am not sure.

Antonio

will try and report back. Thanks for your tips.

On Sat, Jun 8, 2013 at 2:10 AM, Antonio Piccolboni < notifications@github.com

wrote:

On Jun 7, 2013 8:26 PM, "ywen2000" notifications@github.com wrote:

Thanks. Back to the original question. How would I even write to a local file system in the reducer?

Try the normal way using the current directory, like png(file name)

It seems that it tried to write to HDFS and then crashed. What is the command I should use?

I would be cautious about generalizing from a crash we don't understand yet. Writing to a temporary local file it's a pretty safe thing to do. Writing in parallel to a distributed file is a completely different thing. If you were writing in java you would use things like multiple output format but this feature is not available in rmr2. Rhdfs was never meant for parallel writes.

I appreciate your help

Happy to help

Antonio

On Fri, Jun 7, 2013 at 12:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

On Jun 7, 2013 5:52 PM, "ywen2000" notifications@github.com wrote:

Thanks. Where can I download the manual?

It's installed with the package. Are you familiar with R help?

I need some suggestion in how to debug the code.

See the debugging guide in the wiki

I saw someone mention running the hadoop in standalone mode. Do you know how to do that?

I would consult the manual of your Hadoop distribution. Rmr has no say on that.

What is the difference between the standalone mode and the normal mode of hadoop?

This is a general Hadoop question and you'll be better served by asking it in the appropriate forum. It's covered in any Hadoop tutorial I've ever run into.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19119182>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19143783>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19199775>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

— Reply to this email directly or view it on GitHub< https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19200342>

.

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19211651 .

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

piccolbo commented 11 years ago

Size and what you need to do with the data should be the main considerations. The specific data types should be irrelevant to your decision.

Antonio On Jun 10, 2013 8:35 PM, "ywen2000" notifications@github.com wrote:

Thank you for pointing me into correct directions. I also felt using rhdfs in reducer was not a good idea. I just wanted to confirm with you.

I think your suggestion should work and I will give an update later. I have a follow-up question for you. If I have multiple return values for the same key, say, keyval(id, list(data1,data2,data3)), how could I post-process it after the mapreduce job? I mean the data could be of various types. For example, data1 could be a date frame, data2 could be an image, etc.. The mapreduce job only emits one output file. Do you suggest loading the file into RAM and process it or using it as an input to a 2nd mapreduce job? I guess it all depends on the file size?

Thanks!

On Mon, Jun 10, 2013 at 12:50 PM, Antonio Piccolboni < notifications@github.com> wrote:

Since rmr2 functions accept functions as arguments, your argument would imply that we need to support anything any user puts in those functions, including all cran packages. I am sorry but that is not going to happen. You are supposed to return your data from the reduce function and Hadoop will write it out in a parallel fashion. If you want to do it your way, you have the right to try but you are on your own. It should work but with horrible performance. The reduce function still needs to return a vector, data frame, matrix or list. I suggested a different approach but you seem to ignore it, so I am not sure I can be of any help. The reduce function should return keyval(image id, list(image data)). Don't use rhdfs in a reduce. Please try this and let me know.

Antonio On Jun 10, 2013 4:04 PM, "ywen2000" notifications@github.com wrote:

I want to generate a file on hdfs in each reducer. I thought I could do the following in each reducer { create file on local file system hdfs.put(local_file_path, hdfs_path) } However, I got the following error:

Error in .hdfsCopy(src = src, dest = dest, srcFS = srcFS, dstFS = dstFS, : attempt to apply non-function Calls: ... is.keyval -> reduce -> hdfs.put -> hdfs.copy -> .hdfsCopy Execution halted

I tried the some hdfs.put command in the command line and it worked. I used the same arguments. It looks like hdfs.put and other rhdfs commands cannot be used in reducer to create/write to files in hdfs. Does it mean the parallel write is not supported in rmr2?

On Mon, Jun 10, 2013 at 9:55 AM, Antonio Piccolboni < notifications@github.com> wrote:

On Jun 10, 2013 3:09 PM, "ywen2000" notifications@github.com wrote:

I found it was a permission problem. My account name for the local system is "hduser" and when I ran mapreduce, everything was under "mapred". No wonder I could not access hduser's local file system. I was able to achieve writing to a particular folder via "chmod a+w". However, it doesn't look like a good solution. Anyway, I could at least write to a file system now.

I guess the next step would be uploading it to HDFS, as you suggested. I Just to clarify, I did not suggest to write to HDFS directly. Just return the data from the reducer. If you need the images in separate files then I am not sure.

Antonio

will try and report back. Thanks for your tips.

On Sat, Jun 8, 2013 at 2:10 AM, Antonio Piccolboni < notifications@github.com

wrote:

On Jun 7, 2013 8:26 PM, "ywen2000" notifications@github.com wrote:

Thanks. Back to the original question. How would I even write to a local file system in the reducer?

Try the normal way using the current directory, like png(file name)

It seems that it tried to write to HDFS and then crashed. What is the command I should use?

I would be cautious about generalizing from a crash we don't understand yet. Writing to a temporary local file it's a pretty safe thing to do. Writing in parallel to a distributed file is a completely different thing. If you were writing in java you would use things like multiple output format but this feature is not available in rmr2. Rhdfs was never meant for parallel writes.

I appreciate your help

Happy to help

Antonio

On Fri, Jun 7, 2013 at 12:54 PM, Antonio Piccolboni < notifications@github.com> wrote:

On Jun 7, 2013 5:52 PM, "ywen2000" notifications@github.com

wrote:

Thanks. Where can I download the manual?

It's installed with the package. Are you familiar with R help?

I need some suggestion in how to debug the code.

See the debugging guide in the wiki

I saw someone mention running the hadoop in standalone mode. Do you know how to do that?

I would consult the manual of your Hadoop distribution. Rmr has no say on that.

What is the difference between the standalone mode and the normal mode of hadoop?

This is a general Hadoop question and you'll be better served by asking it in the appropriate forum. It's covered in any Hadoop tutorial I've ever run into.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19119182>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19143783>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19199775>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

— Reply to this email directly or view it on GitHub<

https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19200342>

.

— Reply to this email directly or view it on GitHub< https://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19211651>

.

Yicheng Wen, PhD Research Engineer Louisville, KY Office: (502) 452-5972 Fax: (502) 452-0371 yicheng.wen@ge.com

— Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/191#issuecomment-19217614 .