Open GoogleCodeExporter opened 9 years ago
I like this idea a lot, am currently doing this sort of thing inside a shell
script, but the problem is that if some transient error occurs, the whole thing
starts over.
Original comment by petenewc...@gmail.com
on 3 Aug 2011 at 1:27
If your dataflow is properly designed it won't need to start everything over,
only incrementally from part which failed to calculated.
To make analogy with Unix 'make' utility - if there is a syntax error in one
source file, it does not require you to recompile whole project - only parts
which depend on it.
Original comment by kroko...@gmail.com
on 3 Aug 2011 at 2:04
Reading more carefully original post, for this kind of calculations (fixed
point, or trying to reach some error threshold) something like this might be
useful. I am a bit cautious about introducing imperative semantics into the
HAMAKE, as it wold require major review of the processing model.
Original comment by kroko...@gmail.com
on 3 Aug 2011 at 2:07
Exactly, that's what this feature would allow me to do. Currently I can't
describe the complete dataflow to hamake because the number of steps in it is
dynamic.
Original comment by petenewc...@gmail.com
on 3 Aug 2011 at 2:08
A more concrete statement of this use case: a repetitive computation starts
with a set of data. This goes through several map/reduce stages. The last job
scans the resulting dataset and makes a judgement: "I'm not done". At this
point, somehow, the first job starts with the output of the next-to-last
computation. It does not restart with the original data.
Original comment by lance.no...@gmail.com
on 3 Aug 2011 at 3:37
I was doing something similar via generations. In fact, this was the reason the
feature was introduced. I've used a shell script to to the following:
1. Run hamake
2. Run program to check an exit condition.
3. If yes, EXIT, if no GOTO 1
The hamake dataflow was using generations to produce new iteration of data.
This way the new iteration did not start get processed in the same session, but
on the next run of a HAMAKE.
I know it is a bit kludgy, but I can make an argument for it. I am not trying
to put logic, deciding whenever conditions was satisfied inside the dataflow.
There are many different cases for doing this and arguably it might be tricky
to make hamake handle all of them. For example there might be a time-based exit
condition, or when certain threshold value is reached, or on external event,
etc.
We need to find a good way to generalize it without breaking dataflow
principles of hamake design. Perhaps instead of imperative GOTO (as originally
suggested in this ticket) we can allow user to plug in external script to
decide whenever dependency is satisfied to extend current timestamp-based logic.
I will be glad to hear arguments and suggestions. Thanks!
Original comment by kroko...@gmail.com
on 3 Aug 2011 at 7:43
It would be helpful for me if it was somehow taken care of within hamake, as
otherwise I'd be left in the position of having an "outer" hamakefile that
executes a script that executes hamake in a loop with a different hamakefile.
How about instead of requiring hamake to support arbitrary conditions, use the
presence or updated timestamp of a specific output file after execution as the
stop condition. That way, it would be left up to the executed script to embed
the real condition logic.
The existing "generation" concept could be extended slightly to allow multiple
generations to be cycled through for such a rule during a single run of hamake,
and other rules could simply depend on the stop condition output in order to
ensure that they are run only after the cycling is complete.
-peter
Original comment by petenewc...@gmail.com
on 3 Aug 2011 at 1:35
Peter, I think your proposal could be a reasonable compromise between iterative
calculation support and keeping HAMAKE dataflow approach. How about the
following approach:
1. We add option to HAMAKE command line specifying number of iterations. It
could be either a number, or infinity, or perhaps time limit. The exact syntax
to be defined.
2. Hamake will work as it does now, but after processing is finished, it will
restart, processing new generations of data (if any). As soon as there is an
iteration which did not run any tasks (all dependencies are already satisfied),
HAMAKE will exit.
Now, to implement an exit condition, all you need to do is to add a task which
conditionally generates a file and use this file as a dependency in your
processing tree. This way if exit condition is reached, the file is not
updated, and the dependencies would be satisfied and hamake would exit.
This is a rough idea, we will need to think through exact details.
Original comment by kroko...@gmail.com
on 3 Aug 2011 at 6:24
That generally makes sense to me, though I would like to be able to have
multiple such "loops" running in parallel within a single hamake process, which
brings two things to mind:
1a. It seems like the iteration/repetition limit should be specified per-rule
instead of for the entire hamake process.
2a. This approach would seem to potentially impede parallel iterative
processes, e.g. if you had a 3-iteration process where each iteration takes 60
minutes to run and a 10-iteration process where each iteration takes 20 minutes
to run, then assuming adequate cluster capacity the complete job will 320
minutes to run instead of the 200 minutes it would take if the first 3
iterations of the 10 iteration process weren't synchronized with those of the 3
iteration process.
-peter
Original comment by petenewc...@gmail.com
on 3 Aug 2011 at 7:18
Here is another use case:
A simple fraud detection system uses a classifier. The classifier
decides whether a group of inputs is a fraudulent event. But there is
a problem with all classifiers- they are not perfect. Some percentage
of these classifications will be wrong. Evaluating a classifier gives
a "confusion matrix", a 2x2 matrix with rows true&false and columns
positive&negative and a count in each box.
Each false positive means that the case is then passed on to a person
who examines it. This costs a certain amount of money. Each false
negative means that a fraudulent situation is not noticed. This costs
a lot more money.
So, we want to find the set of "control knobs" for the fraud detection
algorithm where false positive cost * number of false positives ==
false negative cost * number of false negatives. To do this, we have
to run the classifier with a lot of different combinations, and
collect & compare the confusion matrices.
This workflow is "for Knob1 in 0 to 100; for Knob2 in 0 to 100 -> run
the same 10-step classifier workflow with (knob1, knob2) as
parameters", then "calculate optimal confusion matrix".
Outside of this you might do hill-climbing, which calculate ranges of
knob values to pursue until a confusion matrix passes with above
equality within some error. This is very definitely an imperative
workflow.
Original comment by lance.no...@gmail.com
on 7 Aug 2011 at 12:02
Original issue reported on code.google.com by
lance.no...@gmail.com
on 31 Jul 2011 at 11:27