facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

OSError: [Errno 36] File name too long #148

Closed agarrubio closed 2 years ago

agarrubio commented 2 years ago

NOTE: if this is not a bug report, please use the GitHub Discussions for support questions (How do I do X?), feature requests, ideas, showcasing new applications, etc.

Bug description extract.py creates strange and very long filenames

Reproduction steps

python extract.py esm_msa1b_t12_100M_UR50S examples/P62593.fasta examples/P62593_reprs/ \
    --repr_layers 34 --include mean --nogpu

Expected behavior Filenames that do not break a linux OS

Logs Please paste the command line output:

(ml) foca% python extract.py esm_msa1b_t12_100M_UR50S examples/P62593.fasta examples/P62593_reprs/ \
    --repr_layers 12 --include mean --nogpu
Read examples/P62593.fasta with 5397 sequences
Processing 1 of 386 batches (1 sequences)
Traceback (most recent call last):
  File "extract.py", line 137, in <module>
    main(args)
  File "extract.py", line 128, in main
    torch.save(
  File "/home/alejandro/mambaforge/envs/ml/lib/python3.8/site-packages/torch/serialization.py", line 369, in save
    with _open_file_like(f, 'wb') as opened_file:
  File "/home/alejandro/mambaforge/envs/ml/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/alejandro/mambaforge/envs/ml/lib/python3.8/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
OSError: [Errno 36] File name too long: "examples/P62593_reprs/['0|beta-lactamase_P20P|1.581033423', '1|beta-lactamase_D207D|1.42563125', '2|beta-lactamase_A215A|1.422813331', '3|beta-lactamase_C75C|1.4155315119999998', '4|beta-lactamase_N134N|1.39696596', '5|beta-lactamase_L137L|1.355533136', '6|beta-lactamase_L28L|1.3516090040000002', '7|beta-lactamase_L199L|1.3516090040000002', '8|beta-lactamase_F149F|1.32191175', '9|beta-lactamase_A200A|1.295473865', '10|beta-lactamase_E210E|1.29406548', '11|beta-lactamase_H24H|1.282201552', '12|beta-lactamase_L19L|1.280029666', '13|beta-lactamase_A183A|1.279505214'].pt"

Additional context Add any other context about the problem here. (like proxy settings, network setup, overall goals, etc.) OS: Linux Mint 20.1 pwd: /home/alejandro/Downloads/esm torch : pytorch 1.7.0 py3.8_cuda10.1.243_cudnn7.6.3_0 pytorch

tomsercu commented 2 years ago

Thanks for flagging. This should be handled with a proper error message. The problem here is that this fasta file does not contain an MSA, and therefore is not meant for input to MSATransformer. Things go awry in the MSABatchConverter. We should just raise an error since the whole extract.py script is written for efficient batched computation of single-sequence language models.