Closed bibikar closed 4 weeks ago
copied from an old PR...
A bunch of this stuff is TODO, just putting the plans here. Once we merge this PR, I think it would be fine to publish the aggregator. The aggregator is currently in a really messy state and there's no way we can publish it as-is.
This PR is a complete rewrite of the aggregator as an actually generic tool to manipulate tables of numbers. Instead of having one big monolithic function which does things in a fixed pipeline, we allow the user to specify the pipeline of execution. The user also gets a temporary namespace to put dataframes in. We retain the same functionality we had before, but configs will all need to be rewritten. We also unify both input and output sections into the pipeline concept as separate pipeline steps, and do away with the global axis/series/variants definition entirely for flexibility. We also get rid of the meaningless Benchmark
class which was created originally to deal with very benchmark-specific things. The structure of that class was a big mess, and while we can still use OOP here, I'm leaning towards making the model as simple as possible
The entire system now isn't completely config-dependent, and we could pass in a deserialized configuration as well, for example. Nesting pipelines is also planned (but I'm thinking about how to implement it exactly).
The aggregator now operates in two big steps:
Pipeline
objects which contain functions bound similarly to those created with functools.partial
. We also perform some sanity-checking on configurations here so things don't fail after reading all the dataframes.Valid pipeline steps are simply annotated functions. Python's inspect
module is used to determine (using the annotations) where dataframes should be passed to the functions, and where parameters from the config should be bound. That means minimal effort for writing new valid pipeline steps, and much easier maintenance of the actual implementations which use Pandas (the pipeline only passes dataframes around and doesn't really care about them aside from that.) The main caveat here is that generally, we now want config keys to have underscores to separate words rather than dashes, since it's hard to write functions with parameter names containing dashes in Python.
Pipeline
into BoundPipeline
. We don't actually want to always bind pipelines to variables - if they're defined inside another pipeline, there actually should be no implicit source/dest of default
, either.
globals()
, and that's messyuncertain_panda
instead of vanilla pandas (#24)Currently this is a toy example, I haven't written much of the functionality for this rewrite yet, but once that's done, the existing configurations can be examples. I'll try to keep this updated as the PR evolves. For example, I'm still not entirely sure on how the separate buffers should be handled... should they just be space for one reference, or stacks of multiple DataFrame references? Currently, the DataFrame-manipulating functions are completely buffer-unaware, which is probably good. Do we even want these buffer spaces? It's definitely convenient for e.g. pulling in reference data from some other table, or pulling in separate tables and merging them in this config. Need to figure out how to handle it properly.
# the entire file is just one big top-level pipeline executed from top to bottom
- input:
file: '*.csv'
format: csv # we could possibly infer this as well in the future
filter: {} # the same filtering syntax that we had before
# since we're particularly whimsical today, let's rename Size column to some meaningless name
- rename:
Size: Foo
# we want only reasonable problem sizes
- filter_in:
Foo: 50000000
# we want only data we care about
- filter_out:
Implementation: linpack
# we want to compare against MKL
- set_column:
Speedup over MKL:
ratio_of:
values: Time
columns: [Prefix, Implementation] # I write the list in this format because it's easier to read here
reference: [Native-C, MKL]
# create a set of pivot tables. this is a crucial step, otherwise we'd just output one big messy table
- pivot_table:
values: Ratio
columns: [Prefix, Implementation]
index: [Function, Accuracy]
aggfunc: mean
variants: [Arch] # this might cause us to create multiple pivot tables
dest: pivot # save this to a different space than our normal data
# send this stuff to the specified output format
- output: {} # empty dict here for no options... could just leave it empty for null as well
src: pivot # use the "pivot" buffer as the input for this pipeline step
Here's some deeper discussion on Python API for the aggregator.
relates to #94, #104 . also relates to #90, as releasing only the Python API might make it easier for us to support.
Current aggregator recipes are not super flexible. They basically force a certain workflow which requires many layers of indirection and boilerplate in configs. This generally makes me think too much about how to fix particular configs, e.g. by adding another layer of indirection to pivot rows onto columns or something like that. (For examples, see PRs #96, #98, #99). While it's not too hard to do that for now, it's already very messy - note how many separate configs we have for random forests and logistic regression and SVM!)
A few problems with the current aggregator config structure include
Anton suggested that we could
For example, we could have these Python configs look like
and then they get access to both our functions and pandas functions, and anything else they need! We just have to package automated_benchmarks into a conda package.
I also want to make the Python API so simple to use that it basically supersedes the YAML configs. A user should really just be able to run
conda install pyrforator
and then write their config.