NVIDIA / NeMo-Aligner

Scalable toolkit for efficient model alignment
Apache License 2.0
625 stars 78 forks source link

fix: correct batch tokenization when sequence exceeds encoder length #352

Closed gwarmstrong closed 1 month ago

gwarmstrong commented 1 month ago

What does this PR do ?

Fixes a bug in tokenize_batch that occurs when there are more tokens than the specified max sequence length.

Changelog

Usage

# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

Checklist when contributing a new algorithm

Additional Information

terrykong commented 1 month ago

Needs #354

edit: needs https://github.com/NVIDIA/NeMo-Aligner/pull/355