Closed phiweger closed 3 years ago
It fails on the assert that sequence labels (header in fasta file) have no duplicates. Reasoning is that since the label will be used to name the output file for the embedding, you don't want to accidentily overwrite with two different sequences with the same name. WIll update the assert for clarity.
What about saving multiple embeddings into the same .pt file? I'm willing to submit a pull request.
Hmm interesting. What concretely are you thinking of doing in terms of design & why? Can it easily be done with relabeling your fasta headers? Seems like either having empty labels OR unique labels in the fasta is a good idea. Having different sequences with the same label just seems intrinsic footgun to me.
@colligant given your thumbs up, I'm closing this for now. But if you want to, feel free to reopen and give me a short proposal for what you had in mind!
Bug description I am trying to embed 10k+ protein sequences (about 250 aa residues each, all stored in a single fasta file
foo
). This errs out. Running the same command onhead -n1000 foo > bar
works as expected. Is there a limit here?Reproduction steps Try to run the model on 10k sequences.
Expected behavior Embed them.
Logs