NVIDIA / NeMo-Aligner

Scalable toolkit for efficient model alignment

Apache License 2.0

625 stars 78 forks source link

fix: correct batch tokenization when sequence exceeds encoder length #352

Closed gwarmstrong closed 1 month ago

gwarmstrong commented 1 month ago

What does this PR do ?

Fixes a bug in tokenize_batch that occurs when there are more tokens than the specified max sequence length.

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

[ ] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

[ ] Does the trainer resume and restore model state all states?
[ ] Does the trainer support all parallelism techniques(PP, TP, DP)?
[ ] Does the trainer support max_steps=-1 and validation?
[ ] Does the trainer only call APIs defined in alignable_interface.py?
[ ] Does the trainer have proper logging?

Additional Information

Related to # (issue)

terrykong commented 1 month ago

~~Needs #354~~

edit: needs https://github.com/NVIDIA/NeMo-Aligner/pull/355