the 41 deep mutational scanning datasets

lzhangUT commented 2 years ago

Hi, @joshim5 Thanks for your work. I was reading your paper 'Language models enable zero-shot prediction of the effects of mutations on protein function' for understanding the model and also want to apply the datasets into my modified model. I searched on the github but couldn't find where the 41 datasets are located (as in Figure 3)? Can you guide me where it is or any links that will get me access to them? Thanks a lot.

I can

joshim5 commented 2 years ago

Thanks for your interest! They come from the following paper: https://www.nature.com/articles/s41592-018-0138-4

Check out the supplementary material in that paper to get the DMS datasets.

Hope that helps!

lzhangUT commented 2 years ago

Thank you very much!

lzhangUT commented 2 years ago

@joshim5 Hi, I also have another question. I am trying to generate MSA following your steps. You mentioned that " we generate new MSAs using the EVMutation methodology". In EVmutation, I think they use jackhmmer for MSA. But here in your examples here, you used HHblits to generate MSA. so I am not sure which one to follow. I am asking because when I use jackhmmer for MSA, I get stockholm file, and in which the alignments have a lot of gaps starting for the target sequence. so 1) how did you deal with the gaps if you were using jackhmmer for MSA? 2) Do you have a code to postprocess the stockholm file to generate the MSA that needed for MSA transformer? 3) Or you use HHblits for all of them, I tried out your examples, the MSA generated did not have gaps. Thanks

joshim5 commented 2 years ago

We used JackHMMer for the MSA experiments - which hhblits code are you referring to?

We preserve the gaps in the MSA. All MSA-based methods we tested in the paper (including MSA Transformer) can handle the MSA gaps correctly.

lzhangUT commented 2 years ago

Thank you very much for your quick response!

I am referring to the examples in your github. https://github.com/facebookresearch/esm/tree/main/examples it seems like you used HHblits to generate MSA for these examples.

joshim5 commented 2 years ago

Ah, these provide an example of the MSAs used for pre-training of the MSA Transformer.

For the ESM-1v work, we used JackHMMer to stay consistent with EVMutation and DeepSequence.

lzhangUT commented 2 years ago

Hi @joshim5 Really appreciated your response! That clarifies a lot！ I have a stockholm output file from jackhmmer for my target sequence against Uniref90 (named HpBKT). just want to make sure: 1) you were saying the methods used in the paper will handle the gaps correctly by themselves. for example, my stockholm file is like this:

then it will handle the gaps in the target sequence (HpBKT), does the 'handle' mean removing the gaps in the target sequence and the corresponding gaps (the same position gaps) in the alignments? or it means all the gaps (including the ones in the target sequence) will be tokenized correctly?

2) I also have another question, in my file, the alignments were splits into 4 parts like this (The target sequence is 320aa, with the gaps in target sequence, it becomes 700+ characters, and is splits into 4, 200 character, 200 character,200 character, 100+ character), when reading the stockholm file, will these parts be pasted together by itself (that is, the model will handle correctly)? I tested it, when removing the gaps and paste the 4 parts together in the target sequence, IT IS the raw target sequence with 320 aa。

3） or even with jackhmmer, are you outputting different format file or using stockholm file? I really appreciate your help！ after these, I can move on with the modelling.

rmrao commented 2 years ago

Hi @lzhangUT! Sorry for the delay in responding, Josh is actually no longer with the team here. I have the exact code used to generate our MSAs for the ESM-1v paper here, along with a dockerized version of DeepSequence: https://github.com/rmrao/DeepSequence/blob/master/align.py.

If you'd like to use DeepSequence yourself, you can directly use this code. DeepSequence expec ts a .a3m file as input, not stockholm, so you would have to modify the format. Gaps in the reference sequence are simply removed.

lzhangUT commented 2 years ago

@rmrao Thank you very much for the response. I really appreciated it.

facebookresearch / esm

the 41 deep mutational scanning datasets #153