Open mjwillson opened 10 years ago
I have just started to build some map/reduce jobs with Parkour. I like everything I see about the library, but I have a good knowledge of the Java API's, and I agree that this really helps. It is also helpful to be pretty versed in Clojure: seqs, reducers, threading macros, etc. What's is missing in my mind, for either the Java API or Parkour is a sort of cookbook. It would be great to get something like this started, and in the next couple of months, I would certainly have some possible recipes worked out. Not really sure how to get a cookbook project similar to https://github.com/clojure-cookbook/clojure-cookbook going however. Publishing is not really my area of expertise.
Thanks, Chris
@mjwillson -- It's not the best example, but there is still one test namespace which tests the Hadoop Job API with the lower-level Parkour-Hadoop integration API: https://github.com/damballa/parkour/blob/master/test/parkour/word_count_test.clj
Documentation for people new to Hadoop is something I struggle with. It's difficult for me to see the gaps because I've worked with Hadoop for so long (and frequently at such a low level). I've thus tried to punt and largely avoid explaining Hadoop fundamentals, but that has certainly made Parkour less accessible to Clojure programmers without existing Hadoop experience.
I'm not sure a full-on Cookbook is the answer, but I'd be happy to collaborate on some "if you're new to Hadoop" documentation.
Thanks for the pointer, yeah that helps.
Agreed that I wouldn't expect a project like this to teach hadoop from scratch. Perhaps where the gaps lie is in spelling out in explicit for-dummies terms how this API translates to and from more canonical boilerplatey usage of the hadoop java API. A few carefully-selected examples might be enough, a whole cookbook would be nice but perhaps not essential. The other thing which could help is: where there are particular awkward quirks to the way the hadoop fundamentals work which are relevant to the design decisions here, perhaps giving a bit of background for newbies on the hadoop side of this, not just what parkour does on top to make it more pleasant.
I'll try and revisit and make some more constructive suggestions about where the gaps lie when I'm a bit further along anyway.
So I have one particular use-case that I'm having difficulty translating into Parkour. I've been pouring over the docs and experimenting, but I am as of yet unable to see the exact path forward. I have done this in Java M/R so will describe it from that perspective.
The job is a map-only job where I override run() to control the flow of records from the split all at once. Inside run()... 1) I write the split's data out to a tempfile in the task attempt's directory. 2) Execute a binary that consumes the tempfile as input and writes an output file 3) Read the binary's output file, process it, and emit records to the job context's output.
This is a very common use case in my domain (bioinformatics), as many of the algorithms we want to process data with take the form of command line tools. Re-implementing those tools as functions that could be used in mapper is not feasible, too many of them, and the algorithms used are research topics in their own right.
I'm having difficulty mapping this into parkour concepts. For example, how is how to return a reducible after performing the above steps. Seems like I need to somehow connect a lazy-seq to a dsink.
If you have the time to point me in the right direction, it would be greatly appreciated.
Thanks, Chris
@chriscnc -- Your Parkour task functions are nearly-literally identical to the Mapper/Reducer class run()
methods. The main difference is that you yield results by the function return-value instead of imperatively pushing them. In your task function you should be able to follow exactly the same process as in a Java job, just with that structural change. Heck, if you wanted to do a 1:1 translation could even grab mr/*context*
, emit records by Java interop, then return nil
from the function.
If you want more functional code by using the return-value to yield results, but want to lazily read from your external tool output, you have two basic options:
line-seq
. You can make this clean by leveraging the fact that Parkour wraps each task execution in a resource scope, and thus ensure that files are closed etc upon task exit: https://github.com/pjstadig/scopes
Hello
Just a small (and selfish, feel free to ignore!) doc request / bit of feedback.
At the moment the library seems very much geared towards people who've already done hadoop the hard way, understand the pain points and want a higher-level DSL which abstracts over them a bit more.
Which is a perfectly valid aim, but I feel like it could also be a bit more accessible to those starting out with hadoop too, with a little more motivating "big picture" documentation along the lines of: here's how you do things directly with hadoop, here's why that's painful, here's how the higher-level constructs in this library help and how they translate to and from the lower-level stuff which you can read about elsewhere.
The docs already do a good job of outlining this in places, although I'm thinking about the parkour.graph stuff in particular -- here there's (what looks to a newbie like) a fair amount of magic introduced, and it's not quite clear how the chains of parkour.graph calls in the examples translate into mapreduce jobs.
Is there a less magical direct way to set up a single mapreduce job with a given mapper and/or reducer using this library, if I want to walk before I run and do things very explicitly for the sake of understanding the lower level to motivate understanding of some of the higher-level APIs and the pain points they're addressing?
Realise that I could do this directly with the hadoop java API, and maybe this is the only way to true enlightenment. But I very much like some of the features of this library like the REPL, idiomatic-clojureness and ease of testing, and it'd be nice to benefit from these while gradually easing into using more abstractions on top of the native hadoop concepts. Which I'm sure I can, just can't see the forest for the trees at the moment.
I'll see how I get on anyway -- keep up the good work! Cheers -Matt