how can I create the input features for my own dataset by running the py. file? Is it possible to simply the input to only sequence of the protein?

FreshAirTonight / af2complex

Predicting direct protein-protein interactions with AlphaFold deep learning neural network models.

146 stars 19 forks source link

how can I create the input features for my own dataset by running the py. file? Is it possible to simply the input to only sequence of the protein? #9

Open Zkkkkkui opened 1 year ago

FreshAirTonight commented 1 year ago

Yes. The input feature generation is similar to AF2. A sample shell script is here for individual sequences.

Zkkkkkui commented 1 year ago

Yes. The input feature generation is similar to AF2. A sample shell script is here for individual sequences.

What about this .py file: run_af2c_fea.py, which is also said to be used to get features?

FreshAirTonight commented 1 year ago

Yes. The input feature generation is similar to AF2. A sample shell script is here for individual sequences.

What about this .py file: run_af2c_fea.py, which is also said to be used to get features?

The shell script calls this python script you referred to do the job.

Zkkkkkui commented 1 year ago

Yes. The input feature generation is similar to AF2. A sample shell script is here for individual sequences.

What about this .py file: run_af2c_fea.py, which is also said to be used to get features?

The shell script calls this python script you referred to do the job.

Could you please give an example of the script on colab just to generate the features of a protein sequence from uniprot?

Zkkkkkui commented 1 year ago

Hi! I still have problem with getting the features of my dataset. I am not able to use AF2complex locally and I am not sure how to run these sh. script. I tried to get a feature.pkl file from Alphafold output after doing the protein prediction but when I used it in this colab to predict compex, it always went wrong like this: KeyError: 'msa'

CalledProcessError Traceback (most recent call last) in 39 40 # with io.capture_output() as captured: ---> 41 get_ipython().run_line_magic('shell', 'python -u ../run_af2c_mod.py {pred_params}') 42 print(f'DONE! (predictions available on {FLAGS.output_dir}' ) could you explain and help me with that? Thank you!

FreshAirTonight commented 1 year ago

The Colab notebook we provided only takes features.pkl files of individual monomers. If you use other AlphaFold notebook to generate input features, make sure that you use the monomer, not multimer, pipeline to generate the input. And tar these pickle files into one single tar ball.

In this example, you have a heterodimer HgcAB composed of two monomers, HgcA and HgcB. Organizes the feature input as the following:

hgc.tar
├── hgc
│   ├── HgcA
│   │   └── features.pkl.gz
│   └── HgcB
│       └── features.pkl.gz

Then tar this folder into a single tarball and upload it to our notebook. Note that our code can take gzipped pickle files directly. It is up to you whether or not to gzip the pickle files before you make the tarball.

After you upload the attached tarball, you may run a test to predict a heterodimer using the target syntax: HgcA/HgcB 433 HgcAB

Zkkkkkui commented 1 year ago

Thank you for the instruction! I have successfully got the features from other AF notebook(it used 0 sequence template and I am not sure whether it would matter compared to your examples) and did prediction on some protein complexes. However it seems to have a very high false negative rate(the proteins were supposed to be interacting but the output was not), is there any way to improve that?

FreshAirTonight commented 1 year ago

Thank you for the instruction! I have successfully got the features from other AF notebook(it used 0 sequence template and I am not sure whether it would matter compared to your examples) and did prediction on some protein complexes. However it seems to have a very high false negative rate(the proteins were supposed to be interacting but the output was not), is there any way to improve that?

Many things to try, such as:

Different DL models if you haven't tried them all, including the monomer DL models
Longer recycles, between 8 to 20
Add structural templates if possible

fereidoon27 commented 2 months ago

@FreshAirTonight In AlphaFold, MSAs are built using jackhmmer and HHblits. To avoid the extensive data downloads and CPU processing, precomputed MSAs and the feature.pkl file can be used instead.

Due to my limited resources, I'm focusing on the GPU-based second step and considering tools like ColabFold to create the feature.pkl files.

What files and steps are needed to create the feature.pkl file? Is there a tool available on Google Colab or Kaggle for this?

FreshAirTonight commented 2 months ago

@fereidoon27 An example of feature generation script run_fea_gen.sh can be found under the example folder. If you have limited resources, consider using the uniprot mode, under which MSA construction uses only the UniProt library (creating dummy files for other sequence library to get around file checking). With this option, you can generate features for hundreds of protein sequences of moderate lengths with a decent workstation (e.g., with 4TB nvme, 24 cores).

You may use precomputed MSAs as well. Under the intended output folder of a protein, create a subfolder named msas and place the MSA files under that folder. The only MSAs required in the uniprot mode is uniprot_hits.sto. Add pdb_hits.sto for templates if you need.