[ ] RL: switch from supervising by correct programs (which induces bias to training samples i.e. proportional to the sampling ratio, which scales bad) to RL approximations based on behavioral correspondence, which does not in itself have a gradient.
[ ] use type-checks for feedback to impartial programs
[ ] penalize by loss
[ ] consider non-binary loss functions:
let outputs_match :: Float = length (filter id output_matches) / length output_matches
-- for each param type instance combo, check if the program compiles (and get outputs...)
let type_compiles :: HashMap [Tp] Boolean = not . null <$> prediction_type_ios
let num_in_type_instances :: Int = size type_compiles
let num_in_type_instances_compile :: Int = size . filter id $ type_compiles
let num_errors :: Int = num_in_type_instances - num_in_type_instances_compile
let ratio_compiles :: Float = num_in_type_instances_compile / num_in_type_instances
let ratio_errors :: Float = 1 - ratio_compiles
[ ] compiler built-in hole option filter, either by interpreter, compiler API or custom type logic~~ - out of scope -- while this would seem ideal and more performant than our current approach, for our current experiment we may just settle for just interpreter-compiling every candidate program.
[ ] try out tree-based encoding methods (similar to R3NN) instead of just serializing expressions to string
[ ] regularize weights: in PyTorch this is done by Adam's weight_decay param; in HaskTorch I don't see that tho the source mentions decay w.r.t. those moment params. still need to figure out what that means and what to put in which param tho. the Adam paper mentions decaying moment matrices based on their respective beta params. they also mention weight decay but on first glance I'm not sure if these are the same thing. Slack thread says unimplemented, mimic grad = grad.add(p,alpha=group['weight_decay']).
[ ] hole sampling: deterministically pick first hole vs uniform sampling, bias toward deeper fills
[ ] scourge the literature for ideas, e.g. deduction
propagation of input-output examples to the smallest tree containing all holes, such as to map the encodings to the most precies space. this should mean inversely applying any pruned super-functions to the output examples. for lossless functions like map this should yield propagated output examples; for lossy functions such as filter this would instead yield type constraints (we know the elements that satisfied the condition, but around and between those there could be any number of elements that did not satisfy the constraint). note I dunno to what extent such type constraints could still be faster than fully calculating the input-output for various synthesized programs...
[ ] perform hyperparam search by bayesian optimization
[ ] consider which hyper-params have been / should be shared across networks
[ ] play with hyper-params, check if these need splitting
[ ] test grammar of unrolled operators vs original grammar (expr -> (expr expr) / vars)
[ ] for supervising with types, optionally replace / supplement instantiated i/o types with abstract one based on task fn type -- if this turns out useful figure out if we can deterministically reconstruct this from the instantiated signatures somehow.
[ ] figure out how to propagate type constraints to the holes
[ ] add as features before passes or after
conceptual:
[ ] for synthesized programs, we apply the same maximum number of holes as used to generate this dataset. this allows our synthesizer enough power to construct the desired programs, while disallowing more complex programs than those of the maximum generated complexity. this choice is arbitrary; yet to think of a nicer solution.
[ ] figure out how to more elegantly limit complexity of synthesized programs (i.e. revise synthMaxHoles, which currently can maybe be deemed unfair for accuracy sampling)
Generation:
In order to have a tractable number of programs, we limited the maximum number of instructions for programs to be 13. Length 13 programs are important for this specific DSL because all larger programs can be written as compositions of sub-programs of length at most 13. The semantics of length 13 programs therefore constitute the “atoms” of this particular DSL.
for synthesis no such limitations are mentioned.
[ ] to allow R3NN a fixed number of samples for its LSTMs, I'm sampling the actual features to make up for potentially multiple type instances giving me a variable number of i/o samples. I opted to pick sampling with replacement, which both more naturally handles sample sizes exceeding the number of items, while also seeming to match the spirit of mini-batching by providing more stochastic gradients. it would be nice to experimentally compare both options tho.
weight_decay
param; in HaskTorch I don't see that tho the source mentions decay w.r.t. those moment params. still need to figure out what that means and what to put in which param tho. the Adam paper mentions decaying moment matrices based on their respective beta params. they also mention weight decay but on first glance I'm not sure if these are the same thing. Slack thread says unimplemented, mimicgrad = grad.add(p,alpha=group['weight_decay'])
.conceptual:
synthMaxHoles
, which currently can maybe be deemed unfair for accuracy sampling)