Closed justin-barton closed 6 months ago
Fixing a bug in the tokenizer batch encoder where the combination of add_special_tokens=True and return_tensors=True leads to sequences being truncated by two residues (when max_sequence_length is not specified).
As described in https://github.com/OpenBioML/protein-lm-scaling/issues/45
Ok good catch! Can you add a test for this? We have the tokenizer tests in tests/test_tokenizer.py
tests/test_tokenizer.py
Fixing a bug in the tokenizer batch encoder where the combination of add_special_tokens=True and return_tensors=True leads to sequences being truncated by two residues (when max_sequence_length is not specified).
As described in https://github.com/OpenBioML/protein-lm-scaling/issues/45