README Example fails with IndexError

kiramt commented 2 years ago

Bug description

When running the example in the README, the code fails with an IndexError.

Reproduction steps I'm running the code in a docker container based on the pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel image from dockerhub. Pip and Python 3.7.10 is installed. Run the container and install fair-esm using pip:

kira@49cc5ed3af0e:/code/esm$ pip install fair-esm
Defaulting to user installation because normal site-packages is not writeable
Collecting fair-esm
  Downloading fair_esm-0.4.0-py3-none-any.whl (37 kB)
Installing collected packages: fair-esm
Successfully installed fair-esm-0.4.0

Run python and follow the code steps in the README:

kira@49cc5ed3af0e:/code/esm$ python3
Python 3.7.10 (default, Feb 26 2021, 18:47:35)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import esm
>>>
>>> # Load ESM-1b model
>>> model, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt" to /home/kira/.cache/torch/hub/checkpoints/esm1b_t33_650M_UR50S.pt
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/regression/esm1b_t33_650M_UR50S-contact-regression.pt" to /home/kira/.cache/torch/hub/checkpoints/esm1b_t33_650M_UR50S-contact-regression.pt
>>> batch_converter = alphabet.get_batch_converter()
>>>
>>> # Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
>>> data = [
...     ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
...     ("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
...     ("protein2 with mask","KALTARQQEVFDLIRD<mask>ISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
...     ("protein3",  "K A <mask> I S Q"),
... ]
>>> batch_labels, batch_strs, batch_tokens = batch_converter(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/code/esm/esm/data.py", line 285, in __call__
    tokens[i, len(seq_str) + int(self.alphabet.prepend_bos)] = self.alphabet.eos_idx
IndexError: index 77 is out of bounds for dimension 1 with size 73

Expected behavior Expected the code to run without error

Logs Please paste the command line output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/code/esm/esm/data.py", line 285, in __call__
    tokens[i, len(seq_str) + int(self.alphabet.prepend_bos)] = self.alphabet.eos_idx
IndexError: index 77 is out of bounds for dimension 1 with size 73

Additional context

liujas000 commented 2 years ago

Hi @kiramt, thank you for the detailed repro instructions! The issue is that tokens[i, len(seq_str) + int(self.alphabet.prepend_bos)] = self.alphabet.eos_idx should instead be tokens[i, len(seq_encoded) + int(self.alphabet.prepend_bos)] = self.alphabet.eos_idx. A fix, and appropriate unit test will be out soon

liujas000 commented 2 years ago

Fixed here

facebookresearch / esm

README Example fails with IndexError #135