Error loading main sequence embeddings

dc2211 commented 4 months ago

Hi everyone,

Im trying to train a new model with my own data. Step 1 (computing embeddings) and 2 (zero-shot) are running perfectly, but step 3 is complaining about missing embeddings:

Error loading main sequence embeddings: At least one embedding was missing

Any suggestions? many thanks!

pascalnotin commented 4 months ago

Hi @dc2211 - could you please confirm that an .h5 file for your new assay was properly created and saved on disk at the following location: $DATA_PATH/data/embeddings/MSA_Transformer, where $DATA_PATH is the location where you downloaded and unzipped the ProteinNPT_data as per step 1 of the setup process?

dc2211 commented 4 months ago

Hi @pascalnotin - yes, the embeddings (.h5 file) are there.

####################################################################################################
Step3: Training the ProteinNPT model on the protdms assay 
####################################################################################################
Location used for target fitness if /home/ProteinNPT
Embeddings folder: /home/ProteinNPT/ProteinNPT_data/ESM/MSA_Transformer/esm_msa1b_t12_100M_UR50S.pt
We want to predict 1 target(s): fitness
Training model for assay: protdms, where the test_fold index is: 0
Effective batch size is 425
Model name: ProteinNPT_protdms_fitness_fold_random_5_embed_MSA_Transformer_head_CNN_aug_none_froz_True_drop_0.0_val_False_base_pipeline_fold-0
Sequence embeddings: /home/ProteinNPT/ProteinNPT_data/data/embeddings/MSA_Transformer/protdms.h5
Target processing train set: {'fitness': {'mean': 0.5739534883720929, 'std': 0.7322810476315698, 'P95': 2.07}}
Number of sequences in MSA (before preprocessing): 98
Calculating proportion of gaps
Proportion of sequences dropped due to fraction of gaps: 0.0%
Proportion of non-focus columns removed: 0.0%
Number of sequences after preprocessing: 95
One-hot encoding sequences
Data Shape = (95, 305, 20)
Loading sequence weights from disk
Neff = 82.33333333333333
Number of sequences:  95
Neff: 82.33333333333333
Name of focus_seq: >5980390
Check sum weights MSA: 81.83333333333333
tmp: Starting training
Model device: cuda:0
  0%|                                                                                                                                  | 0/10000 [00:00<?, ?it/s]
Error loading main sequence embeddings: At least one embedding was missing
  0%|

pascalnotin commented 4 months ago

Hi @dc2211 -- is your assay a substitution assay (ie., all mutated sequences of same length) or does it contain also indels? Is there any gaps ('-') in the assay sequences? Could you print a few lines from it?

dc2211 commented 4 months ago

Hi @pascalnotin,

all sequences are the same length. Actually this error is coming after I did all the rebase from #11 .

I tried with an old local install of the repo that I have, and it run just fine with the same data. Is there anything in particular that I can print here for you to help on tracking the source of the error? Also I tried an old training that was working, and now is not. Thanks!

dc2211 commented 4 months ago

also, no gaps, no indels. Just substitutions.

pascalnotin commented 4 months ago

The rebase did not change anything to the code that would trigger an issue there. Have you tried deleting the current version of the embedding file (protdms.h5) and recreating it from scratch (or replacing it with the version that worked without issue in your experiment with the older local copy of the repo)?

dc2211 commented 4 months ago

I tried:

removing and recomputing embeddings
copying embeddings from the old local installation (which I believe is version 1.2)

and the problem still persists, with Error loading main sequence embeddings: At least one embedding was missing. I made sure I was using the correct sequence, with the right length, linked with the correct MSA and DMS data. Many thanks.

dc2211 commented 4 months ago

I will close this issue because I ended up combining scripts from both local installations, which now works completely (that is, from getting the embeddings, computing zero shots, training and evaluating). Thanks for all the help.

pascalnotin commented 4 months ago

@dc2211 - great to hear the issue was resolved at your end! Could you please share a bit more about the setting that worked for you? Running the pipeline works well on our end on ProteinGym assays, but we would like to make sure the codebase extends to as many settings as possible.

pascalnotin commented 3 months ago

Hi @dc2211 -- just pushed an updated version which may have resolved the issue you were encountering earlier (assuming I properly guessed what the issue was). Please update package to v1.5 beforehand. Let me know if that works for you!

dc2211 commented 3 months ago

Hi @pascalnotin -- thanks for the updated version. It seems is working fine for training, but I'm still confused in terms of the needs to do new predictions using the eval.sh process, and the error I'm getting. I tried the following:

Train with dataset X, and get embeddings for X and checkpoints X (sequence length 400)
Train with dataset Y, and get embeddings for Y and checkpoints Y (sequence length 400)
Take checkpoint Y, and infer from embeddings X (zero shot location set to None in eval.sh). This one does not works and I get Error loading main sequence embeddings.
Haven't tried adding the zero_shot predictions, but don't think it should make a difference.
The problem is fixed if I use checkpoint Y on embeddings Y.

Both datasets correspond to similar proteins, trained using different MSAs. Also, I understand there is a check for the lengths of indices_retrieved_embedding and batch["mutant_mutated_seq_pairs"] in data_processing.py, but I do not get why does this changes if I tried cross dataset X and Y.

Should it be possible to mix both X and Y, if both set have the same sequence length? Please correct me if I'm wrong. I hope it makes sense! Thanks!

PD: I notice there is a problem as well with the zero_shot predictions that if I do a first run without them, I get an edited DMS file with the new mutant column, and the dummies mutant_x. If I try to start again, now asking for zeroshot predictions, and the DMS files already has the mutant column, I get the error `ValueError: invalid literal for int() with base 10: 'utant'`. Maybe if the column is overwritten (if is there) when zero_shots are True, this error might be avoided.

pascalnotin commented 3 months ago

Hi @dc2211 -- you only need to train your model once (e.g., dataset X), but if you want to predict things on a new dataset (e.g., dataset Y) then you need to embed sequences for dataset X and Y in the same file. This is because the training sequences (in X) are used at inference to predict sequences in Y (via attention across rows). Since you use two different MSAs for dataset X and Y to get embeddings, this is a bit of an edge case not currently supported. The way to resolve is to concatenate the embeddings for both datasets in a single file and then use the eval script. Hope that helps!

pascalnotin commented 1 month ago

Hi @dc2211 -- closing this issue as I believe the above addressed the concerns. Feel free to re-open as needed. Kind regards, Pascal

OATML-Markslab / ProteinNPT

Error loading main sequence embeddings #12