facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
2.97k stars 586 forks source link

index out of bounds during zero-shot with msa1b #649

Open Maxwell-downtown opened 5 months ago

Maxwell-downtown commented 5 months ago

When running zero-shot variant prediction using msa1b with the codes provided in examples/variant-prediction, I came across the following error: File "predict.py", line 180, in <lambda> lambda row: label_row( File "predict.py", line 114, in label_row score = token_probs[0, 1 + idx, mt_encoded] - token_probs[0, 1 + idx, wt_encoded] IndexError: index 216 is out of bounds for dimension 1 with size 216 the code I use is as followed: python predict.py --model-location esm_msa1b_t12_100M_UR50S --sequence MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW --dms-input ./data/BLAT_ECOLX_Ranganathan2015.csv --mutation-col mutant --dms-output ./data/BLAT_ECOLX_Ranganathan2015_labeled.csv --offset-idx 1 --scoring-strategy masked-marginals --msa-path ./data/MSA/trial_BLAT.a2m I use the entire BLAT_ECOLX sequences of 286aa as the input sequence, and all the entries in my .a2m file are of the same length. I also set the -offset-idx to 1, but it doesn't seem to work. I print out the dimension of the batch_tokens and the token_probs in predict.py and find the size which I think represents the length of the protein sequence is 216 while it should be 286 in this case. Other proteins of different length were also tested, but the dimensions never match. Am i understanding the dimensions of the token_probs wrong? Besides, running the demonstration codes under examples/variant-prediction with data provided in this directory results in error RuntimeError: Received unaligned sequences for input to MSA, all sequence lengths must be equal. code: python predict.py \ --model-location esm_msa1b_t12_100M_UR50S \ --sequence HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW \ --dms-input ./data/BLAT_ECOLX_Ranganathan2015.csv \ --mutation-col mutant \ --dms-output ./data/BLAT_ECOLX_Ranganathan2015_labeled.csv \ --offset-idx 24 \ --scoring-strategy masked-marginals \ --msa-path ./data/BLAT_ECOLX_1_b0.5.a3m