NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
611 stars 83 forks source link

Added a translation pipeline for ctranslate2 inference #245

Open uahmed93 opened 2 months ago

uahmed93 commented 2 months ago

Description

This PR enables ctranslate2 model translation. This will work when CrossFit support for ctranslate2 model is added.(PR)

Usage

python3 NeMo-Curator/examples/ct2_trasnlation_example.py --input-data-dir <inp-dir> --output-data-dir <out-dir> --ct2-model-path <ct2-model-dir>  --files-per-partition 1 --input-text-field indic_proc_text --tgt-lang mar_Deva

Checklist