Open aybarsnazlica opened 1 year ago
@rmrao Roshan, could you take a look at this please?
I think the failure is happening now due to this code change. variant-prediction now removes lowercase letters from the input MSA, which is consistent with contact_prediction.ipynb. The A3M spec on one website at least says lowercase letters correspond to insertions, which I believe means they should be removed.
But the lowercase letters in the demo a3m files seem to be different. This error shows that removing lowercase letters from BLAT_ECOLX_1_b0.5.a3m
yields sequences with different lengths. I wonder which tool was used to create BLAT_ECOLX_1_b0.5.a3m
? Should all the MSA reading code be standardized around that tool's definition of an a3m file?
Bug description Running the example command for the MSA Transformer in the Variant Prediction example in https://github.com/facebookresearch/esm/tree/main/examples/variant-prediction results in a runtime error "Received unaligned sequences for input to MSA, all sequence lengths must be equal".
Reproduction steps python predict.py \
--model-location esm_msa1b_t12_100M_UR50S \ --sequence HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW \ --dms-input ./data/BLAT_ECOLX_Ranganathan2015.csv \ --mutation-col mutant \ --dms-output ./data/BLAT_ECOLX_Ranganathan2015_labeled.csv \ --offset-idx 24 \ --scoring-strategy masked-marginals \ --msa-path ./data/BLAT_ECOLX_1_b0.5.a3m
Logs Traceback (most recent call last): File "predict.py", line 241, in
main(args)
File "predict.py", line 167, in main
batch_labels, batch_strs, batch_tokens = batch_converter(data)
File "/home/ubuntu/miniconda3/envs/msa_trans/lib/python3.7/site-packages/esm/data.py", line 328, in call
"Received unaligned sequences for input to MSA, all sequence "
RuntimeError: Received unaligned sequences for input to MSA, all sequence lengths must be equal.