RevolutionAnalytics / RHadoop

RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki
763 stars 278 forks source link

huge combiner or reducer failures #100

Closed piccolbo closed 12 years ago

piccolbo commented 12 years ago

It seems that when a combine and reduce are too big jobs don't fail, a few tasks are killed, error logs are not very instructive and NAs are intermixed with the results where they shouldn't be. It could be a time out or an out of memory, but a timeout normally causes the job to fail. I think it could be more an R out of memory that doesn't exit with any message or error code. This is happening in 1.2.2 so with no rmr C code to speak of, which is used from 1.3. At a minimum, I woud like to see task attempts fail or succeed with correct results, not this in-between. Second is there any way to control the number of combiners and reducers to avoid this issue -- there is for reducers but it is left to the user for now. Another approach would be to sidestep the use of lists completely in the case of structured data and keep everything as data frames which are much more compact (a data frame path from input to output). This would bring memory usage down, but doesn't solve the problem of R not failing when it should.

piccolbo commented 12 years ago

Now I see this a little different. It seems that all tasks succeeded after a variable number of attempts. My best guess is that they failed on out of memory issue but when re-attempted on a more lightly loaded cluster they succeeded. The odd results seem to depend on an overflow (caused by my own programming error, or R not having long ints).