axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.97k stars 879 forks source link

pretrain doesn't work on json\jsonl #1895

Open SicariusSicariiStuff opened 2 months ago

SicariusSicariiStuff commented 2 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

To work the same as when loading the dataset from HF

Current behaviour

Asks for a custom .py script

Steps to reproduce

Load a local json file:

pretraining_dataset: /home/sicarius/somefile.jsonl type: pretrain

Config yaml

pretraining_dataset: /home/sicarius/somefile.jsonl
    type: pretrain

Possible solution

Treat it similarly as a loading a dataset from the HF hub

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

latest release

Acknowledgements

NanoCode012 commented 1 week ago

Hey, sorry it's been a while. We are currently internally discussing providing better support for this and pre-training/sft in general. We plan to extend support to local and cloud storage (S3 etc).