RevolutionAnalytics / RHadoop

RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki
763 stars 278 forks source link

Mahout compatibility, DistributedRowMatrix format #39

Open piccolbo opened 12 years ago

piccolbo commented 12 years ago

moving the discussion from #7 to narrow focus

dlyubimov commented 12 years ago

ok i took a brief look and it seems vector 2 string serialization is already supported in Mahout so apparently somebody already thought of supporting streaming of these things. So we can use SequenceFileAsText input format. good.

So key can be anything, but most commonly it is either Text or numeric writable. shouldn't be a problem for majority of pipelines.

the format of the vector is

'{' 1*(index ':' value ',' ) '}'

using RFC-822 EBNF notation. Note that last element may have comma after it, it seems, and elements may be sparse and unordered.

So i guess we can deserialize it here on the side of R.

Writing Mahout vectors i guess is not as easy. we need some sort of record writer with dependencies on Mahout libraries.

so Q # 1 what do i need to write to support a custom input parsing into vector? (i guess one of those text input format functions?)

Q # 2: how do i instruct mapreduce() to load additional java classes into job classpath?

piccolbo commented 12 years ago

Great so as far as # 1 you need to write a function of one line of text that returns one key-value pair as in

function(line) {strsplit(line, ":");  k = ...; v = ... ;keyval(k,v)}

you can look at raw.text.input.format in dev for the simplest one. The csv.text.* ones are higher order so they may be confusing for this task. Other useful examples are the json.text.* functions. We need to do the writer as well at some point but let's do one thing at a time. As to # 2 actually I don't have more than a workaround to suggest. The argument tuning.parameters can accept backend specific options. The idea was to use it only for performance, but we can stretch it for a moment. What would you need to pass is something like list(hadoop = list(libjars = file.jar)). That will be passed to hadoop streaming as -libjars file.jar. Please check the streaming manual that this is the right syntax. This is not the recommended use of the tuning.parameters argument and will not work for the local backend, so I would call it a workaround for now and decide how that fits the API better later. It's pretty clear to me that we need to support it one way or another.

piccolbo commented 12 years ago

In the binary-io branch you can find semi-working support for reading sequence files through the typedbytes IO specs. Names are pretty tentative and there are some important changes in the mapreduce interface as far as IO is concerned. My hope is that if you do a mapreduce(input = , input.specs = make.input.specs("typedbytes"), ...) it should just work. If you give it a spin let me know how it works, I could use some help in getting these new features to a more complete state. Please keep in mind that's a feature branch under heavy development.

piccolbo commented 12 years ago

the branch is now merged into dev and in much better shape

dlyubimov commented 12 years ago

In the binary-io branch you can find semi-working support for reading sequence files

What do you mean by binary-io branch? can't see one . I guess you've mentioned it's now merged -- has it been deleted?

What environment do you use to debug ? Revolutions IDE?

I was thinking a little bit about it, and it seems i'd like to consider 3 separate tasks w.r.t Mahout

1) writing/ reading DRM files iteratively or in chancks in front end 2) just have a frontend invocation facade to a bunch of mahout methods (such as MR versions of ALS-WR and stochastic SVD), i.e. MR pipeline driver essentially for some of this stuff 3) reading DRM files inside mr tasks, and writing DRM files (collection of vectors) as multiple outputs when needed.

1 is probably what hdfs package should be able to help with.

3 kind of comes closer to developing methods in R based on Mahout DRM format rather than using Mahout per se and at this point is probably both most expensive and least priority on my pragmatic tasks lists (i.e. it's just a capability for the sake of the capability itself w.r.t. my current tasks).

With that in mind, perhaps it is worth to add a seaprate R package handling those classes of issues that depends on hdfs and mr pacakges?

piccolbo commented 12 years ago

Yes, there is no reason to keep a feature branch around after merging it. The IDE question maybe belongs to a different issue.

1) In rmr we've adopted a dicotomy: data is either big, and then you deal with it with mapreduce, or it is small, and then you can do a from.dfs on it and continue in memory. It is possible that there is a something in between where you can not fit the data in RAM but don't need mapreduce to process it which sounds related to you chunks idea. That would be the 100G to 1TB kinds of size. My long term view is to still use mapreduce on those datesets but with different backends, like a SMP backends. I think it's too early to know how it will play out. And yes, I would look at the low level section of rhdfs. 2) I think this could be interesting and useful but out of scope for this project 3) Your interpretation of why we have format flexibility in rmr is not correct. rmr has its own native format and there is no reason not to use it unless you are working in a mixed environment where some tasks and some people use mahout and some use rmr. By allowing rmr to read Mahout generated data, we enable this mixed environment. The goal with rmr is to separate format concerns and algorithmic concerns as much as possible, so "devloping based on format X" should not be necessary and not even encouraged (would you like to have one version of kmeans for JSON data and one for sequence data? I wouldn't). It seems your best bet is a separate package with rhdfs as a dependency for low level reading. As far as rmr, I am not sure why that should be a dependency if you are not interested in implementing mapreduce algorithms on this data. If you need the typedbytes reader, I have to warn you that it will not work outside rmr, I suspect, because it is a raw typedbytes reader, whereas Mahout, if I get it right, writes sequence files with typedbytes keys and values. rmr relies on streaming to understand the sequence format and pass down only raw typedbytes and the R function takes it from there.

dlyubimov commented 12 years ago

The IDE question maybe belongs to a different issue.

I am just asking for a practical advise what tool is best to use given the project layout and build strategy. I have not developed R packages before.

would you like to have one version of kmeans for JSON data and one for sequence data? I wouldn't)

I don't think i am saying this. MR pattern relies on adapter pattern as far as input and output format are concerned. But plugging different formats is always a possiblitiy. Question is what it takes to develop such adapters that would write/read DRM if configured.

I think this could be interesting and useful but out of scope for this project

I think those are a bunch of coherent concerns that come up in practice and that deserve its own package (part of this project or not is another issue). I am just saying it probably isn't worth it to enable various aspects of Mahout integration and then disperse them thru different packages. Now question is whether RHadoop packages can provide enough value to justify dependency on them, or maybe the RHadoop framework is indeed too devoted to support its own way of serialization of R objects so that no such way of adaptation is making sense.