We require the ability to configure the tokenizer.decode call, as well as model args in the AutoModelForCausalLM.from_pretrained to support models like ReplitLM.
What changed
We add two input arguments with safe default behaviour to the main.py script:
clean_up_tokenization_spaces : bool
this boolean flag is passed to tokenizer.decode to prevent tokenization spaces from being cleaned up. This flag affects spacing and therefore syntax in generated code with certain tokenizers such as the ReplitLM tokenizer.
defaults to True, stores False
automodel_kwargs: json.loads, aka. a "stringified" JSON
a "stringified" JSON that sets what default config values should be overriden in this harness to reproduce results.
updates default init config key-values by being passed into the AutoModelForCausalLM.from_pretrained as kwargs. See the logic of why and how this works here in the transformers documentation.
Why
We require the ability to configure the
tokenizer.decode
call, as well as model args in theAutoModelForCausalLM.from_pretrained
to support models like ReplitLM.What changed
We add two input arguments with safe default behaviour to the
main.py
script:clean_up_tokenization_spaces
:bool
tokenizer.decode
to prevent tokenization spaces from being cleaned up. This flag affects spacing and therefore syntax in generated code with certain tokenizers such as the ReplitLM tokenizer.True
, storesFalse
automodel_kwargs
:json.loads
, aka. a "stringified" JSONAutoModelForCausalLM.from_pretrained
askwargs
. See the logic of why and how this works here in thetransformers
documentation."{}"
.Rollout
[x] This is fully backward and forward compatible