Fixing sequence clipping bug in tokenizer

OpenBioML / protein-lm-scaling

Other

54 stars 15 forks source link

Fixing sequence clipping bug in tokenizer #46

Closed justin-barton closed 6 months ago

justin-barton commented 9 months ago

Fixing a bug in the tokenizer batch encoder where the combination of add_special_tokens=True and return_tensors=True leads to sequences being truncated by two residues (when max_sequence_length is not specified).

As described in https://github.com/OpenBioML/protein-lm-scaling/issues/45

jamaliki commented 9 months ago

Ok good catch! Can you add a test for this? We have the tokenizer tests in tests/test_tokenizer.py