Zuricho / ParaFold_dev

ParaFold - under development
4 stars 0 forks source link

create fake pdb70 msa file #1

Open ken83715 opened 1 year ago

ken83715 commented 1 year ago

Hi, First thanks for your work on create_fakemsa and create_manual_template, I made some modify to make it work on multimer mode. But I find out that Alphafold keeps doing query on pdb70 database anyway, even if I create pdb_hits.sto it still do the query and replace that file. Is there anyway to stop Alphafold from querying pdb70 database?

Thank you!

Zuricho commented 1 year ago

Actually I cannot answer your question because I did not skip the template selection step by this. My approach is to use a template with identical sequence as a "man-made" template, which is not that reasonable for selecting template. I agreed that create a fake pdb70 msa file is a better solution, but I did not make it till now. Maybe I will look into this in near future, and we can have some discussion on that (and I'm also curious how you did that 🤣). Thanks a lot for your attention.

ken83715 commented 1 year ago

Thanks for replying, I'm currently using original Alphafold repo with some modified.

My goal is also use a specific pdb file as template, so I have to made Alphafold not to select templates from database (which takes several hours long), and with -use_precomputed_msas as true, the create_fakemsa generates files in output_folder/msas, so that alphafold will use these files instead of spending very long time searching templates. Then the create_manual_template transform pdb file to features.pkl. (am I understand it right?)

The multimer mode generates two folders A and B in msas folder (two sequence for example), each contain files generated by create_fakemsa. There is also a chain_id_map.json file needed in msas folder.

Im not sure if monomer mode and multimer mode search from the same genetic databases set or there are some difference, multimer mode search result include bfd_uniref_hits.a3m, mgnify_hits.sto, pdb_hits.sto, uniprot_hits.sto, uniref90_hits.sto, monomer mode search result include bfd_uniref_hits.a3m, mgnify_hits.sto, pdb_hits.hhr, uniref90_hits.sto

I struggled for several days until found your solution, couldn't make it without these two python files, thanks again and if you know other way to stop Alphafold search templates, just predict straightly, I will be very thankful.

Zuricho commented 1 year ago

Actually we are working on the same objective. I previously worked on a adapted version called ParaFold to split the CPU and GPU part. In ParaFold, I use the feature.pkl file to link the CPU (actually MSA part) and GPU part (actually the AlphaFold model). So, the AlphaFold input is solely depend on the feature.pkl file

My approach is to edit the feature.pkl or create fake feature.pkl to hack AlphaFold inputs, but I only have tried in monomer (multimer models might be more complex, but similar).

In this repo, I added 3 different ways to play with feature.pkl:

  1. create an empty feature.pkl: no MSA, no template, nothing except input sequence information is in the feature.pkl file. You can find it here: https://github.com/Zuricho/ParaFold_dev/blob/main/parafold/create_empty_feature.py.
  2. read a manual template: You can find my code in the function named make_manual_template_features in https://github.com/Zuricho/ParaFold_dev/blob/main/parafold/create_manual_template.py. Actually, I read coordinates from a .pdb file and set the coordinates corresbond to the input .pdb file.

I know what you are doing, like making manual pdb_hits files to modify the alignment between template pdb and input sequence. I also tried this before but found it might be more complicate than I thought (maybe it is because I did not fully understand the align process🤣). Maybe I can look into this sooner or later.