delphi-suite / delphi

small language models training made easy
Apache License 2.0
9 stars 1 forks source link

tokenize dataset script with custom huggingface namespaces #93

Closed joshuawe closed 6 months ago

joshuawe commented 6 months ago

Currently we can only upload to the delphi-suite namespace on HF, but ideally we would like to do that for any name space specified by the user. Small change required in line 75

https://github.com/delphi-suite/delphi/blob/5b7ec89061c6d111c665678a97d895efe5414a53/scripts/tokenize_dataset.py#L74-L78

joshuawe commented 6 months ago

Same goes for loading dataset and using the pretrained tokenizer.