Why

We require the ability to configure the tokenizer.decode call, as well as model args in the AutoModelForCausalLM.from_pretrained to support models like ReplitLM.

What changed

We add two input arguments with safe default behaviour to the main.py script:

clean_up_tokenization_spaces : bool
- this boolean flag is passed to tokenizer.decode to prevent tokenization spaces from being cleaned up. This flag affects spacing and therefore syntax in generated code with certain tokenizers such as the ReplitLM tokenizer.
- defaults to True, stores False
automodel_kwargs: json.loads, aka. a "stringified" JSON
- a "stringified" JSON that sets what default config values should be overriden in this harness to reproduce results.
- updates default init config key-values by being passed into the AutoModelForCausalLM.from_pretrained as kwargs. See the logic of why and how this works here in the transformers documentation.
- defaults to empty stringified JSON: "{}".

Rollout

[x] This is fully backward and forward compatible

bigcode-project / bigcode-evaluation-harness

Adding additional optional args for decoding flags and AutoModel kwargs to support models like ReplitLM #115

Why

What changed

Rollout