Support custom calibration datasets

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

7.35k stars 794 forks source link

Support custom calibration datasets #1762

Closed DreamGenX closed 2 weeks ago

DreamGenX commented 2 weeks ago

With this change, the user can supply a custom json or jsonl with a text column for calibration.

nv-guomingz commented 2 weeks ago

Hi @syuoni ,would u please review this PR?

syuoni commented 2 weeks ago

Thanks @DreamGenX . I've started to integrate this PR to our internal repo, and probably this will be available in the next Github weekly update.

DreamGenX commented 2 weeks ago

Thank you @syuoni -- I just noticed it should also be probably added for the nemo codepath.

syuoni commented 2 weeks ago

Hi @DreamGenX , this PR is using dataset_name_or_dir as a single data file path, and supports json/jsonl only.

An more general way is to prepare a HF dataset repo (according to this instruction), and use the local path to this repo for calibration. This allows more complex data structures (e.g., multiple files) and formats (e.g., csv). Does this make sense to you?

syuoni commented 2 weeks ago

Thank you @syuoni -- I just noticed it should also be probably added for the nemo codepath.

Yes, I will cover the nemo path. Thanks for the reminder.

syuoni commented 2 weeks ago

Hi @DreamGenX , this PR is using dataset_name_or_dir as a single data file path, and supports json/jsonl only.

An more general way is to prepare a HF dataset repo (according to this instruction), and use the local path to this repo for calibration. This allows more complex data structures (e.g., multiple files) and formats (e.g., csv). Does this make sense to you?

Hi @DreamGenX , We will go with this HF repo style support for customized calibration dataset.

If you have a data file path/to/data.json that you want to pass to dataset_name_or_dir, you can just:

move it to a separate folder and rename like path/to/repo/train.json, then
pass path/to/repo to dataset_name_or_dir.