facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

Unable to load many sequences #95

Closed phiweger closed 3 years ago

phiweger commented 3 years ago

Bug description I am trying to embed 10k+ protein sequences (about 250 aa residues each, all stored in a single fasta file foo). This errs out. Running the same command on head -n1000 foo > bar works as expected. Is there a limit here?

Reproduction steps Try to run the model on 10k sequences.

Expected behavior Embed them.

Logs

python esm/extract.py esm1b_t33_650M_UR50S foo my_reprs/ --repr_layers 33 --include mean
Traceback (most recent call last):
  File "esm/extract.py", line 144, in <module>
    main(args)
  File "esm/extract.py", line 71, in main
    dataset = FastaBatchedDataset.from_file(args.fasta_file)
  File ".../tmp/fair-esm/esm/esm/data.py", line 52, in from_file
    assert len(set(sequence_labels)) == len(sequence_labels)
AssertionError
tomsercu commented 3 years ago

It fails on the assert that sequence labels (header in fasta file) have no duplicates. Reasoning is that since the label will be used to name the output file for the embedding, you don't want to accidentily overwrite with two different sequences with the same name. WIll update the assert for clarity.

colligant commented 3 years ago

What about saving multiple embeddings into the same .pt file? I'm willing to submit a pull request.

tomsercu commented 3 years ago

Hmm interesting. What concretely are you thinking of doing in terms of design & why? Can it easily be done with relabeling your fasta headers? Seems like either having empty labels OR unique labels in the fasta is a good idea. Having different sequences with the same label just seems intrinsic footgun to me.

tomsercu commented 3 years ago

@colligant given your thumbs up, I'm closing this for now. But if you want to, feel free to reopen and give me a short proposal for what you had in mind!