This PR adds the Shadereval tasks, Shadereval looks at "creative coding" by comparing models on shadercode(glsl with Shadertoy.com syntax). There is (currently) two tasks:
shadereval-1: ReturnCompletion as seen in demo space. Using a subset of Shadertoys-fine dataset to predict the return statement of functions. Metric used is exact_match. The original implementation is hosted as an EvaluationSuite inside the demo space.
shadereval-2: FunctionCompletion as seen in demo space, Using a processed version of the Shadertoys dataset, taking docstrings immediately before a function or within the top of the body as a prompt. Evaluated with a custom metric (hosted in the demo space) that first compares the text(minus some white spaces), and then renders frames of prediction and reference. If the code doesn't render an image it is counted as code_error, if the images match it's counted as image_match. The remaining cases are variations, but not directly calculated by the metric.
The paper for the first task and dataset has been written but not yet published, the second task is still in development. The goal is to finish the implementation by the end of this year. Most changes will be made to the dataset or metric which are not part of this PR. The code to run generations included here should be about complete.
Remaining tasks:
[x] add documentation (including results)
[ ] cleanup various comments, especially #TODO:...
revival of #97
This PR adds the Shadereval tasks, Shadereval looks at "creative coding" by comparing models on shadercode(glsl with Shadertoy.com syntax). There is (currently) two tasks:
shadereval-1: ReturnCompletion as seen in demo space. Using a subset of Shadertoys-fine dataset to predict the return statement of functions. Metric used is
exact_match
. The original implementation is hosted as anEvaluationSuite
inside the demo space.shadereval-2: FunctionCompletion as seen in demo space, Using a processed version of the Shadertoys dataset, taking docstrings immediately before a function or within the top of the body as a prompt. Evaluated with a custom metric (hosted in the demo space) that first compares the text(minus some white spaces), and then renders frames of prediction and reference. If the code doesn't render an image it is counted as
code_error
, if the images match it's counted asimage_match
. The remaining cases arevariations
, but not directly calculated by the metric.The paper for the first task and dataset has been written but not yet published, the second task is still in development. The goal is to finish the implementation by the end of this year. Most changes will be made to the dataset or metric which are not part of this PR. The code to run generations included here should be about complete.
Remaining tasks:
#TODO:...