facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
2.97k stars 586 forks source link

ESMFold for multimer fails when using HuggingFace installation #656

Open eliottpark opened 5 months ago

eliottpark commented 5 months ago

When attempting to use HuggingFace's ESMFold implementation for multimers with the suggested ':' separator (from the README) between chain sequences, I get a ValueError when submitting to the tokenizer. I am ok with using the artificial glycine linker suggested by HuggingFace's tutorial, but would like clarification if the ':' separator approach suggested by this github's README is valid. Thanks!

Reproduction steps Install huggingface transformers module via conda (version 4.24.0)

from transformers import EsmTokenizer, AutoTokenizer, EsmForProteinFolding

tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1") # Download tokenizer
# OR
tokenizer = EsmTokenizer.from_pretrained("facebook/esmfold_v1")  # Download alternative tokenizer

model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1")  # Download model

seq = chain1_seq + ":" + chain2_seq # Concatenate sequences with suggested delimiter

inputs = tokenizer([seq], return_tensors="pt", add_special_tokens=False) # Tokenize seq

Expected behavior I expect the tokenizer to be able to handle the ':' delimiter and not raise an error. I tried both tokenizers (AutoTokenizer and EsmTokenizer) and both yielded the same error. As suggested by the ValueError, I tried to turn on padding and truncation, but the output remained the same.

Logs Filepaths in output are truncated for privacy reasons.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File [~/.../site-packages/transformers/tokenization_utils_base.py:715](.../site-packages/transformers/tokenization_utils_base.py:715), in BatchEncoding.convert_to_tensors(self, tensor_type, prepend_batch_axis)
    [714](.../site-packages/transformers/tokenization_utils_base.py:714) if not is_tensor(value):
--> [715](.../site-packages/transformers/tokenization_utils_base.py:715)     tensor = as_tensor(value)
    [717](.../site-packages/transformers/tokenization_utils_base.py:717)     # Removing this for now in favor of controlling the shape with `prepend_batch_axis`
    [718](.../site-packages/transformers/tokenization_utils_base.py:718)     # # at-least2d
    [719](.../site-packages/transformers/tokenization_utils_base.py:719)     # if tensor.ndim > 2:
    [720](.../site-packages/transformers/tokenization_utils_base.py:720)     #     tensor = tensor.squeeze(0)
    [721](.../site-packages/transformers/tokenization_utils_base.py:721)     # elif tensor.ndim < 2:
    [722](.../site-packages/transformers/tokenization_utils_base.py:722)     #     tensor = tensor[None, :]

RuntimeError: Could not infer dtype of NoneType

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
[.../src/embed.ipynb](.../src/embed.ipynb) Cell 139 line 1
----> [1](.../src/embed.ipynb#Y253sZmlsZQ%3D%3D?line=0) inputs = tokenizer([seq], return_tensors="pt", add_special_tokens=False)

File [.../site-packages/transformers/tokenization_utils_base.py:2488](.../site-packages/transformers/tokenization_utils_base.py:2488), in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   [2486](.../site-packages/transformers/tokenization_utils_base.py:2486)     if not self._in_target_context_manager:
   [2487](.../site-packages/transformers/tokenization_utils_base.py:2487)         self._switch_to_input_mode()
-> [2488](.../site-packages/transformers/tokenization_utils_base.py:2488)     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
   [2489](.../site-packages/transformers/tokenization_utils_base.py:2489) if text_target is not None:
...
    [735](.../site-packages/transformers/tokenization_utils_base.py:735)             " expected)."
    [736](.../site-packages/transformers/tokenization_utils_base.py:736)         )
    [738](.../site-packages/transformers/tokenization_utils_base.py:738) return self

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Additional context python version = 3.8 transformers version = 4.24.0

tonyreina commented 3 months ago

https://github.com/tonyreina/antibody-affinity/blob/main/esmfold_multimer.ipynb

The HuggingFace model doesn't use the ":". I've included a link to my notebook that shows how to do multimer predictions. The hack is that you include a linker sequence of G between all chains. So a single sequence linked by Gs is passed to the model. This linker output is masked when producing the PDB file. I suspect the ":" does the same thing under the hood.