Closed DreamGenX closed 2 weeks ago
Hi @syuoni ,would u please review this PR?
Thanks @DreamGenX . I've started to integrate this PR to our internal repo, and probably this will be available in the next Github weekly update.
Thank you @syuoni -- I just noticed it should also be probably added for the nemo codepath.
Hi @DreamGenX , this PR is using dataset_name_or_dir
as a single data file path, and supports json/jsonl only.
An more general way is to prepare a HF dataset repo (according to this instruction), and use the local path to this repo for calibration. This allows more complex data structures (e.g., multiple files) and formats (e.g., csv). Does this make sense to you?
Thank you @syuoni -- I just noticed it should also be probably added for the nemo codepath.
Yes, I will cover the nemo path. Thanks for the reminder.
Hi @DreamGenX , this PR is using
dataset_name_or_dir
as a single data file path, and supports json/jsonl only.An more general way is to prepare a HF dataset repo (according to this instruction), and use the local path to this repo for calibration. This allows more complex data structures (e.g., multiple files) and formats (e.g., csv). Does this make sense to you?
Hi @DreamGenX , We will go with this HF repo style support for customized calibration dataset.
If you have a data file path/to/data.json
that you want to pass to dataset_name_or_dir
, you can just:
path/to/repo/train.json
, then path/to/repo
to dataset_name_or_dir
.
With this change, the user can supply a custom
json
orjsonl
with atext
column for calibration.