Calculation and Turing-Completeness

GoogleCodeExporter commented 9 years ago

FIrst off- it's nice. I come from the Unix tradition of short&sweet. Of course, 
I want a new feature :)

Hadoop workflow drivers all seem to believe in code&fire&forget: some allow 
if/then/else branching and oozie supports expressions. I have not found any 
that support conditionally starting over or jumping into the middle.

Many machine learning algorithms use repetitive calculations driving towards a 
finishing condition (often a maximum error value). The Mahout project, which 
focuses on machine learning on Map/Reduce, has a great many multistage 
calculation jobs which have to be coded in Java entirely due to this.

Here is a very simple feature addition that would solve all of these: a 
decision node that calls a Java object. One of the possible decisions would be: 
branch X is a goto to another place in the workflow. This would hack the 
generation numbers on the paths after the target node.

Original issue reported on code.google.com by lance.no...@gmail.com on 31 Jul 2011 at 11:27

GoogleCodeExporter commented 9 years ago

I like this idea a lot, am currently doing this sort of thing inside a shell 
script, but the problem is that if some transient error occurs, the whole thing 
starts over.

Original comment by petenewc...@gmail.com on 3 Aug 2011 at 1:27

GoogleCodeExporter commented 9 years ago

If your dataflow is properly designed it won't need to start everything over, 
only incrementally from part which failed to calculated. 

To make analogy with Unix 'make' utility - if there is a syntax error in one 
source file, it does not require you to recompile whole project - only parts 
which depend on it.

Original comment by kroko...@gmail.com on 3 Aug 2011 at 2:04

GoogleCodeExporter commented 9 years ago

Reading more carefully original post, for this kind of calculations (fixed 
point, or trying to reach some error threshold) something like this might be 
useful. I am a bit cautious about introducing imperative semantics into the 
HAMAKE, as it wold require major review of the processing model.

Original comment by kroko...@gmail.com on 3 Aug 2011 at 2:07

GoogleCodeExporter commented 9 years ago

Exactly, that's what this feature would allow me to do.  Currently I can't 
describe the complete dataflow to hamake because the number of steps in it is 
dynamic.

Original comment by petenewc...@gmail.com on 3 Aug 2011 at 2:08

GoogleCodeExporter commented 9 years ago

A more concrete statement of this use case: a repetitive computation starts 
with a set of data. This goes through several map/reduce stages. The last job 
scans the resulting dataset and makes a judgement: "I'm not done". At this 
point, somehow, the first job starts with the output of the next-to-last 
computation. It does not restart with the original data.

Original comment by lance.no...@gmail.com on 3 Aug 2011 at 3:37

GoogleCodeExporter commented 9 years ago

I was doing something similar via generations. In fact, this was the reason the 
feature was introduced. I've used a shell script to to the following:

1. Run hamake
2. Run program to check an exit condition. 
3. If yes, EXIT, if no GOTO 1

The hamake dataflow was using generations to produce new iteration of data. 
This way the new iteration did not start get processed in the same session, but 
on the next run of a HAMAKE.

I know it is a bit kludgy, but I can make an argument for it. I am not trying 
to put logic, deciding whenever conditions was satisfied inside the dataflow. 
There are many different cases for doing this and arguably it might be tricky 
to make hamake handle all of them. For example there might be a time-based exit 
condition, or when certain threshold value is reached, or on external event, 
etc.

We need to find a good way to generalize it without breaking dataflow 
principles of hamake design. Perhaps instead of imperative GOTO (as originally 
suggested in this ticket) we can allow user to plug in external script to 
decide whenever dependency is satisfied to extend current timestamp-based logic.

I will be glad to hear arguments and suggestions. Thanks!

Original comment by kroko...@gmail.com on 3 Aug 2011 at 7:43

GoogleCodeExporter commented 9 years ago

It would be helpful for me if it was somehow taken care of within hamake, as 
otherwise I'd be left in the position of having an "outer" hamakefile that 
executes a script that executes hamake in a loop with a different hamakefile.

How about instead of requiring hamake to support arbitrary conditions, use the 
presence or updated timestamp of a specific output file after execution as the 
stop condition.  That way, it would be left up to the executed script to embed 
the real condition logic.

The existing "generation" concept could be extended slightly to allow multiple 
generations to be cycled through for such a rule during a single run of hamake, 
and other rules could simply depend on the stop condition output in order to 
ensure that they are run only after the cycling is complete.

-peter

Original comment by petenewc...@gmail.com on 3 Aug 2011 at 1:35

GoogleCodeExporter commented 9 years ago

Peter, I think your proposal could be a reasonable compromise between iterative 
calculation support and keeping HAMAKE dataflow approach. How about the 
following approach:

1. We add option to HAMAKE command line specifying number of iterations. It 
could be either a number, or infinity, or perhaps time limit. The exact syntax 
to be defined.

2. Hamake will work as it does now, but after processing is finished, it will 
restart, processing new generations of data (if any). As soon as there is an 
iteration which did not run any tasks (all dependencies are already satisfied), 
HAMAKE will exit.

Now, to implement an exit condition, all you need to do is to add a task which 
conditionally generates a file and use this file as a dependency in your 
processing tree. This way if exit condition is reached, the file is not 
updated, and the dependencies would be satisfied and hamake would exit.

This is a rough idea, we will need to think through exact details.

Original comment by kroko...@gmail.com on 3 Aug 2011 at 6:24

GoogleCodeExporter commented 9 years ago

That generally makes sense to me, though I would like to be able to have 
multiple such "loops" running in parallel within a single hamake process, which 
brings two things to mind:

1a. It seems like the iteration/repetition limit should be specified per-rule 
instead of for the entire hamake process.

2a. This approach would seem to potentially impede parallel iterative 
processes, e.g. if you had a 3-iteration process where each iteration takes 60 
minutes to run and a 10-iteration process where each iteration takes 20 minutes 
to run, then assuming adequate cluster capacity the complete job will 320 
minutes to run instead of the 200 minutes it would take if the first 3 
iterations of the 10 iteration process weren't synchronized with those of the 3 
iteration process.

-peter

Original comment by petenewc...@gmail.com on 3 Aug 2011 at 7:18

GoogleCodeExporter commented 9 years ago

Here is another use case:

A simple fraud detection system uses a classifier. The classifier
decides whether a group of inputs is a fraudulent event. But there is
a problem with all classifiers- they are not perfect. Some percentage
of these classifications will be wrong. Evaluating a classifier gives
a "confusion matrix", a 2x2 matrix with rows true&false and columns
positive&negative and a count in each box.

Each false positive means that the case is then passed on to a person
who examines it. This costs a certain amount of money. Each false
negative means that a fraudulent situation is not noticed. This costs
a lot more money.

So, we want to find the set of "control knobs" for the fraud detection
algorithm where false positive cost * number of false positives ==
false negative cost * number of false negatives. To do this, we have
to run the classifier with a lot of different combinations, and
collect & compare the confusion matrices.

This workflow is "for Knob1 in 0 to 100; for Knob2 in 0 to 100 -> run
the same 10-step classifier workflow with (knob1, knob2) as
parameters", then "calculate optimal confusion matrix".

 Outside of this you might do hill-climbing, which calculate ranges of
knob values to pursue until a confusion matrix passes with above
equality within some error. This is very definitely an imperative
workflow.

Original comment by lance.no...@gmail.com on 7 Aug 2011 at 12:02

dkangala / hamake

Calculation and Turing-Completeness #48