IntelPython / bearysta

Pandas-based statistics aggregation tool
Apache License 2.0
3 stars 4 forks source link

Rethink aggregator recipes and the aggregator in general #14

Closed bibikar closed 4 weeks ago

bibikar commented 4 years ago

Here's some deeper discussion on Python API for the aggregator.

relates to #94, #104 . also relates to #90, as releasing only the Python API might make it easier for us to support.

Current aggregator recipes are not super flexible. They basically force a certain workflow which requires many layers of indirection and boilerplate in configs. This generally makes me think too much about how to fix particular configs, e.g. by adding another layer of indirection to pivot rows onto columns or something like that. (For examples, see PRs #96, #98, #99). While it's not too hard to do that for now, it's already very messy - note how many separate configs we have for random forests and logistic regression and SVM!)

A few problems with the current aggregator config structure include

Anton suggested that we could

For example, we could have these Python configs look like

from pyrforator import aggregate as agg
import pandas as pd

def recipe(path='data/sklearn*.csv', other_options...):
  # read data...
  # preprocess can be in the same format of regex -> (repl|drop|None)
  # or it could just be a function which returns the filtered line or None to drop it!
  df = agg.read_csv(path, preprocess={'@ Package': 'drop'}, ...pandas options...)

  # compare to native C
  df['Ratio'] = agg.ratio_of(df, columns=['Prefix'], values=['Time'], against=('Native-C',))

  return df

agg.run(recipe)

and then they get access to both our functions and pandas functions, and anything else they need! We just have to package automated_benchmarks into a conda package.

I also want to make the Python API so simple to use that it basically supersedes the YAML configs. A user should really just be able to run conda install pyrforator and then write their config.

bibikar commented 4 years ago

copied from an old PR...

A bunch of this stuff is TODO, just putting the plans here. Once we merge this PR, I think it would be fine to publish the aggregator. The aggregator is currently in a really messy state and there's no way we can publish it as-is.

This PR is a complete rewrite of the aggregator as an actually generic tool to manipulate tables of numbers. Instead of having one big monolithic function which does things in a fixed pipeline, we allow the user to specify the pipeline of execution. The user also gets a temporary namespace to put dataframes in. We retain the same functionality we had before, but configs will all need to be rewritten. We also unify both input and output sections into the pipeline concept as separate pipeline steps, and do away with the global axis/series/variants definition entirely for flexibility. We also get rid of the meaningless Benchmark class which was created originally to deal with very benchmark-specific things. The structure of that class was a big mess, and while we can still use OOP here, I'm leaning towards making the model as simple as possible

The entire system now isn't completely config-dependent, and we could pass in a deserialized configuration as well, for example. Nesting pipelines is also planned (but I'm thinking about how to implement it exactly).

The aggregator now operates in two big steps:

  1. Read configs and construct pipelines. We transform configurations into Pipeline objects which contain functions bound similarly to those created with functools.partial. We also perform some sanity-checking on configurations here so things don't fail after reading all the dataframes.
  2. Execute pipelines. We now perform the computation on the actual data. Because input sections are just pipeline steps, reading the data actually only happens now, and the entire pipeline is executed. Output sections are also the same way.

Valid pipeline steps are simply annotated functions. Python's inspect module is used to determine (using the annotations) where dataframes should be passed to the functions, and where parameters from the config should be bound. That means minimal effort for writing new valid pipeline steps, and much easier maintenance of the actual implementations which use Pandas (the pipeline only passes dataframes around and doesn't really care about them aside from that.) The main caveat here is that generally, we now want config keys to have underscores to separate words rather than dashes, since it's hard to write functions with parameter names containing dashes in Python.

TODO that really should be done before merging, so we don't lose functionality or have a broken project

optional TODO that could be in other PRs

Example

Currently this is a toy example, I haven't written much of the functionality for this rewrite yet, but once that's done, the existing configurations can be examples. I'll try to keep this updated as the PR evolves. For example, I'm still not entirely sure on how the separate buffers should be handled... should they just be space for one reference, or stacks of multiple DataFrame references? Currently, the DataFrame-manipulating functions are completely buffer-unaware, which is probably good. Do we even want these buffer spaces? It's definitely convenient for e.g. pulling in reference data from some other table, or pulling in separate tables and merging them in this config. Need to figure out how to handle it properly.

# the entire file is just one big top-level pipeline executed from top to bottom
- input:
    file: '*.csv'
    format: csv # we could possibly infer this as well in the future
    filter: {} # the same filtering syntax that we had before
# since we're particularly whimsical today, let's rename Size column to some meaningless name
- rename:
    Size: Foo
# we want only reasonable problem sizes
- filter_in: 
    Foo: 50000000
# we want only data we care about
- filter_out:
    Implementation: linpack
# we want to compare against MKL
- set_column:
    Speedup over MKL:
      ratio_of:
        values: Time
        columns: [Prefix, Implementation] # I write the list in this format because it's easier to read here
        reference: [Native-C, MKL]
# create a set of pivot tables. this is a crucial step, otherwise we'd just output one big messy table
- pivot_table:
    values: Ratio
    columns: [Prefix, Implementation]
    index: [Function, Accuracy]
    aggfunc: mean
    variants: [Arch] # this might cause us to create multiple pivot tables
  dest: pivot # save this to a different space than our normal data
# send this stuff to the specified output format
- output: {} # empty dict here for no options... could just leave it empty for null as well
  src: pivot # use the "pivot" buffer as the input for this pipeline step