Open eliottpark opened 5 months ago
https://github.com/tonyreina/antibody-affinity/blob/main/esmfold_multimer.ipynb
The HuggingFace model doesn't use the ":". I've included a link to my notebook that shows how to do multimer predictions. The hack is that you include a linker sequence of G between all chains. So a single sequence linked by Gs is passed to the model. This linker output is masked when producing the PDB file. I suspect the ":" does the same thing under the hood.
When attempting to use HuggingFace's ESMFold implementation for multimers with the suggested ':' separator (from the README) between chain sequences, I get a ValueError when submitting to the tokenizer. I am ok with using the artificial glycine linker suggested by HuggingFace's tutorial, but would like clarification if the ':' separator approach suggested by this github's README is valid. Thanks!
Reproduction steps Install huggingface transformers module via conda (version 4.24.0)
Expected behavior I expect the tokenizer to be able to handle the ':' delimiter and not raise an error. I tried both tokenizers (
AutoTokenizer
andEsmTokenizer
) and both yielded the same error. As suggested by the ValueError, I tried to turn on padding and truncation, but the output remained the same.Logs Filepaths in output are truncated for privacy reasons.
Additional context python version = 3.8 transformers version = 4.24.0