facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.04k stars 595 forks source link

Command for the MSA Transformer in the Variant Prediction example results in a runtime error #458

Open aybarsnazlica opened 1 year ago

aybarsnazlica commented 1 year ago

Bug description Running the example command for the MSA Transformer in the Variant Prediction example in https://github.com/facebookresearch/esm/tree/main/examples/variant-prediction results in a runtime error "Received unaligned sequences for input to MSA, all sequence lengths must be equal".

Reproduction steps python predict.py \
--model-location esm_msa1b_t12_100M_UR50S \ --sequence HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW \ --dms-input ./data/BLAT_ECOLX_Ranganathan2015.csv \ --mutation-col mutant \ --dms-output ./data/BLAT_ECOLX_Ranganathan2015_labeled.csv \ --offset-idx 24 \ --scoring-strategy masked-marginals \ --msa-path ./data/BLAT_ECOLX_1_b0.5.a3m

Logs Traceback (most recent call last): File "predict.py", line 241, in main(args) File "predict.py", line 167, in main batch_labels, batch_strs, batch_tokens = batch_converter(data) File "/home/ubuntu/miniconda3/envs/msa_trans/lib/python3.7/site-packages/esm/data.py", line 328, in call "Received unaligned sequences for input to MSA, all sequence " RuntimeError: Received unaligned sequences for input to MSA, all sequence lengths must be equal.

nikitos9000 commented 1 year ago

@rmrao Roshan, could you take a look at this please?

Jacoberts commented 1 year ago

I think the failure is happening now due to this code change. variant-prediction now removes lowercase letters from the input MSA, which is consistent with contact_prediction.ipynb. The A3M spec on one website at least says lowercase letters correspond to insertions, which I believe means they should be removed.

But the lowercase letters in the demo a3m files seem to be different. This error shows that removing lowercase letters from BLAT_ECOLX_1_b0.5.a3m yields sequences with different lengths. I wonder which tool was used to create BLAT_ECOLX_1_b0.5.a3m? Should all the MSA reading code be standardized around that tool's definition of an a3m file?