Use less state-full design for collecting metrics

izgzhen commented 5 years ago

discussion started in https://github.com/CozySynthesizer/cozy/pull/93

anhnamtran commented 5 years ago

@Calvin-L @izgzhen I have several questions:

What do you think the stats structure would look like? Can it be a class with fields for all the data it stores? Can it just be a dictionary created by the improve_implementation method?
Should the stats structure handle threading itself or should improve_implementation handle threading. More specifically, since we are updating it in the ImproveQueryJob for-loop, are they sharing the same stats structure? Or are we somehow making one for each, then aggregating them into one in improve_implementation? Or maybe there's a better way than both of the ones I said.
How would we go about getting the difference in time between each improve loop iteration (the "forever" while-loop) if we want to preserve its responsibility and not have an out-parameter?
What Zhen mentioned in his comment in #93, not messing with improve itself also creates some difficulties in implementing improvement number limit.

Calvin-L commented 5 years ago

Ultimately the design is up to you, but:

What do you think the stats structure would look like?

It doesn't really matter, as long as it is documented. I would probably do something lightweight, like "a dictionary whose keys are method names and whose values are lists of the timestamps of when improvements were made".

since we are updating it in the ImproveQueryJob for-loop, are they sharing the same stats structure?

That would be an example of shared mutable state, so I'm opposed to it. It would be much better for each job to collect its own local statistics and to aggregate them at the end. If you use the dictionary I proposed, each job would store a list and the aggregation would just arrange them into a dictionary by job name.

How would we go about getting the difference in time between each improve loop iteration

This can be done by the loop in ImproveQueryJob.

How comfortable are you with generators (sometimes called "coroutines") in Python?

def generate():
  i = 0
  while i < 100:
    print("yielding...")
    yield i
    i += 1

for value in generate():
  print("got {}".format(value))

In the example, the generator generate() and the loop interleave: instead of printing 100 "yielding..." lines followed by 100 "got X" lines, the program alternates between "yielding..." and "got X".

The improve loop is a generator, so the time difference between yields can be easily calculated by the caller. Instead of printing "got X", compute the current time and subtract it from the previous time.

You may want to alter improve so that it yields once on every loop iteration, even if an improvement was not found. (In the other thread I wrote up a more detailed description of this idea.)

not messing with improve itself also creates some difficulties in implementing improvement number limit

Actually it doesn't! The reason you were messing with improve was to make it check the improvement number and stop when it reached a threshold. However improve already comes with a super-general stopping mechanism: the stop_callback parameter. All you should need to do is modify the stop callback passed by the ImproveQueryJob.

If you intend the limit to be global rather than per-job, then instead you should modify the top-level loop that collects results to call stop_jobs(...) and return when a threshold is reached.

CozySynthesizer / cozy

Use less state-full design for collecting metrics #102