flyark / AFM-LIS

Local Interaction Score (LIS) Calculation from AlphaFold-Multimer (Enhanced Protein-Protein Interaction Discovery via AlphaFold-Multimer)
https://www.flyrnai.org/tools/fly_predictome
MIT License
42 stars 6 forks source link

Managing Colab-generated PDB output files #6

Closed mavericb closed 2 months ago

mavericb commented 3 months ago

Issue Description

We are encountering significant challenges in managing files generated by ColabFold. The current naming convention does not seem to be compatible with AFM-LIS code.

Current File Naming Convention

Here are examples of the current file names generated by ColabFold:

cite.bibtex config.json log.txt run_1_0__T_0.075__seed_111__num_res_44__num_ligand_res_0__use_ligand_context_True__ligand_cutoff_distance_8.0__batch_size_1__number_of_batches_5__model_path_._model_params_ligandmpnn_v_32_010_25.pt.a3m run_1_0__T_0.075__seed_111__num_res_44__num_ligand_res_0__use_ligand_context_True__ligand_cutoff_distance_8.0__batch_size_1__number_of_batches_5__model_path_._model_params_ligandmpnn_v_32_010_25.pt_coverage.png run_1_0__T_0.075__seed_111__num_res_44__num_ligand_res_0__use_ligand_context_True__ligand_cutoff_distance_8.0__batch_size_1__number_of_batches_5__model_path_._model_params_ligandmpnn_v_32_010_25.pt_scores_alphafold2_multimer_v3_model_1_seed_000.json run_1_0__T_0.075__seed_111__num_res_44__num_ligand_res_0__use_ligand_context_True__ligand_cutoff_distance_8.0__batch_size_1__number_of_batches_5__model_path_._model_params_ligandmpnn_v_32_010_25.pt_scores_alphafold2_multimer_v3_model_2_seed_000.json run_1_0__T_0.075__seed_111__num_res_44__num_ligand_res_0__use_ligand_context_True__ligand_cutoff_distance_8.0__batch_size_1__number_of_batches_5__model_path_._model_params_ligandmpnn_v_32_010_25.pt_scores_alphafold2_multimer_v3_model_3_seed_000.json run_1_0__T_0.075__seed_111__num_res_44__num_ligand_res_0__use_ligand_context_True__ligand_cutoff_distance_8.0__batch_size_1__number_of_batches_5__model_path_._model_params_ligandmpnn_v_32_010_25.pt_scores_alphafold2_multimer_v3_model_4_seed_000.json run_1_0__T_0.075__seed_111__num_res_44__num_ligand_res_0__use_ligand_context_True__ligand_cutoff_distance_8.0__batch_size_1__number_of_batches_5__model_path_._model_params_ligandmpnn_v_32_010_25.pt_scores_alphafold2_multimer_v3_model_5_seed_000.json run_1_0__T_0.075__seed_111__num_res_44__num_ligand_res_0__use_ligand_context_True__ligand_cutoff_distance_8.0__batch_size_1__number_of_batches_5__model_path_._model_params_ligandmpnn_v_32_010_25.pt_unrelaxed_alphafold2_multimer_v3_model_1_seed_000.pdb run_1_0__T_0.075__seed_111__num_res_44__num_ligand_res_0__use_ligand_context_True__ligand_cutoff_distance_8.0__batch_size_1__number_of_batches_5__model_path_._model_params_ligandmpnn_v_32_010_25.pt_unrelaxed_alphafold2_multimer_v3_model_2_seed_000.pdb run_1_0__T_0.075__seed_111__num_res_44__num_ligand_res_0__use_ligand_context_True__ligand_cutoff_distance_8.0__batch_size_1__number_of_batches_5__model_path_._model_params_ligandmpnn_v_32_010_25.pt_unrelaxed_alphafold2_multimer_v3_model_3_seed_000.pdb run_1_0__T_0.075__seed_111__num_res_44__num_ligand_res_0__use_ligand_context_True__ligand_cutoff_distance_8.0__batch_size_1__number_of_batches_5__model_path_._model_params_ligandmpnn_v_32_010_25.pt_unrelaxed_alphafold2_multimer_v3_model_4_seed_000.pdb run_1_0__T_0.075__seed_111__num_res_44__num_ligand_res_0__use_ligand_context_True__ligand_cutoff_distance_8.0__batch_size_1__number_of_batches_5__model_path_._model_params_ligandmpnn_v_32_010_25.pt_unrelaxed_alphafold2_multimer_v3_model_5_seed_000.pdb

AFM-LIS code: ` def calculate_pae(pdb_file_path: str, print_results: bool = True, pae_cutoff: float = 12.0, name_separator: str = "___"): parser = PDB.PDBParser() file_name = pdb_file_path.split("/")[-1] data_folder = pdb_file_path.split("/")[-2]

if 'rank' not in file_name:
    if print_results:
        print(f"Skipping {file_name} as it does not contain 'rank' in the file name.")
    return None

# Splitting the file name first with '_unrelaxed'
parts = file_name.split('_unrelaxed')
if len(parts) < 2:
    if print_results:
        print(f"Warning: File {file_name} does not follow expected '_unrelaxed' naming convention. Skipping this file.")
    return None

protein_2_temp = parts[1]  # Defining protein_2_temp here to use later for rank extraction

# Using name_separator to separate protein_1 and protein_2 from the first part
if parts[0].count(name_separator) == 1:
    protein_1, protein_2 = parts[0].split(name_separator)
    pae_file_name = data_folder + '+' + protein_1 + name_separator + protein_2 + '_pae.png'
elif parts[0].count(name_separator) > 1:
    protein_1 = parts[0]
    protein_2 = parts[0]
    pae_file_name = data_folder + '+' + protein_1 + '_pae.png'
else:
    if print_results:
        print(f"Warning: Unexpected file naming convention for {file_name}. Skipping this file.")
    return None

# Extract rank information from protein_2_temp
if "_unrelaxed_rank_00" in file_name:
    rank_temp = file_name.split("_unrelaxed_rank_00")[1]
    rank = rank_temp.split("_alphafold2")[0]
else:
    rank = "Not Available"  # or any default value you prefer

` https://github.com/flyark/AFM-LIS/blob/main/alphafold_interaction_scores_github_20240421.ipynb

mavericb commented 3 months ago

parameters available in current filename:

Parameter Value Explanation
run 1_0 Run or iteration number
T 0.075 Temperature, likely used in a simulation or optimization process
seed 111 Seed for random number generation, ensures reproducibility
num_res 44 Number of residues in the protein or complex
num_ligand_res 0 Number of ligand residues (in this case, none)
use_ligand_context True Indicates whether ligand context was considered in the analysis
ligand_cutoff_distance 8.0 Cutoff distance (in Angstroms) for ligand interactions
batch_size 1 Size of the batch used in processing
number_of_batches 5 Total number of batches processed
model_path . Model path (in this case, the current directory)
model_params ligandmpnn_v_32_010_25.pt Name of the model parameters file
unrelaxed - Indicates the structure has not undergone relaxation
alphafold2_multimer_v3 - Version of AlphaFold used (multimer v3)
model 1 Specific model number used
seed 000 Another seed, likely used in a different phase of the process
mavericb commented 3 months ago

Not only is the separator ___ never present, but also "rank" is never present, and the code always skips the file since: if 'rank' not in file_name: if print_results: print(f"Skipping {file_name} as it does not contain 'rank' in the file name.") return None

flyark commented 3 months ago

I am sorry for the issue related to the file naming convention.

I am not familiar with ligandmpnn-derived file names, but a temporary way you can try is renaming the files temporarily to be compatible with LIS calculation. Once the calculation is done, you can convert them back to their original names.

The LIS calculation can be done at the folder level regardless of the progress of the ColabFold prediction. When AlphaFold-Multimer is in progress, there can be a mix of ranked files (finished) and temporary files (not finished). The current code is designed to calculate finished predictions that have "ranked" in the file name.

Here is a temporal approach you can use:

  1. Create a CSV or TSV file: This file should contain the original names of your JSON and PDB files. Add columns for the new, calculation-compatible names (with "rank").

  2. Make a bash Script for Renaming based on the CSV or TSV file (you can use chatgpt to make custom bash script):

    • Rename the files to follow the naming convention required by the LIS calculation (e.g., current name -> target__candidate{n}_rank_001, 002, 003, 004, 005 for both JSON and PDB files).
    • Perform the LIS calculation using these renamed files.
  3. Revert the File Names:

    • After the LIS calculation, use the CSV/TSV file to rename the files back to their original names.
mavericb commented 3 months ago

Hey there, thanks so much for your answer!! I renamed the files using the convention suggested.

target___candidate_1_rank_001.pdb  target___candidate_1_rank_003.pdb  target___candidate_1_rank_005.pdb
target___candidate_1_rank_002.pdb  target___candidate_1_rank_004.pdb

In particular, I created a script to rank by average pLDDT gained from files, but nothing is generated, even though I don't get any errors. I think the problem is that the naming convention is still not compatible. In fact, I get: Debug: protein_2_temp: .pdb Debug: protein_1: target, protein_2: candidate_1_rank_003, pae_file_name: renamed+target___candidate_1_rank_003_pae.png Debug: Rank: Not Available

Do you have an example of a correct filename that I can transform my files into?

Thanks so much for your time and help

flyark commented 3 months ago

Try these.

protein_1_protein_2_unrelaxed_rank_001_alphafold2_multimer_v3_model_3_seed_000.pdb protein1protein_2_scores_rank_001_alphafold2_multimer_v3_model_5_seed_000.json protein_1_protein_3_unrelaxed_rank_001_alphafold2_multimer_v3_model_5_seed_000.pdb protein1protein_3_scores_rank_001_alphafold2_multimer_v3_model_1_seed_000.json protein_1_protein_4_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb protein1protein_4_scores_rank_001_alphafold2_multimer_v3_model_4_seed_000.json protein_1_protein_5_unrelaxed_rank_001_alphafold2_multimer_v3_model_4_seed_000.pdb protein1protein_5_scores_rank_001_alphafold2_multimer_v3_model_3_seed_000.json protein_1_protein_6_unrelaxed_rank_001_alphafold2_multimer_v3_model_3_seed_000.pdb protein1protein_6_scores_rank_001_alphafold2_multimer_v3_model_3_seed_000.json protein_1_protein_7_unrelaxed_rank_001_alphafold2_multimer_v3_model_3_seed_000.pdb protein1protein_7_scores_rank_001_alphafold2_multimer_v3_model_3_seed_000.json protein_1_protein_8_unrelaxed_rank_001_alphafold2_multimer_v3_model_3_seed_000.pdb protein1protein_8_scores_rank_001_alphafold2_multimer_v3_model_3_seed_000.json protein_1_protein_9_unrelaxed_rank_001_alphafold2_multimer_v3_model_3_seed_000.pdb protein1protein_9_scores_rank_001_alphafold2_multimer_v3_model_3_seed_000.json protein_1_protein_10_unrelaxed_rank_001_alphafold2_multimer_v3_model_3_seed_000.pdb protein1protein_10_scores_rank_001_alphafold2_multimer_v3_model_1_seed_000.json protein_1_protein_11_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb protein1protein_11_scores_rank_001_alphafold2_multimer_v3_model_3_seed_000.json

On Aug 6, 2024, at 12:53 AM, mavericb @.***> wrote:

Hey there, thanks so much for your answer!! I renamed the files using the convention suggested.

target_candidate_1_rank_001.pdb target_candidate_1_rank003.pdb targetcandidate_1_rank_005.pdb target_candidate_1_rank002.pdb targetcandidate_1_rank_004.pdb In particular, I created a script to rank by average pLDDT gained from files, but nothing is generated, even though I don't get any errors. I think the problem is that the naming convention is still not compatible. In fact, I get: Debug: protein_2_temp: .pdb Debug: protein_1: target, protein_2: candidate_1_rank_003, pae_filename: renamed+targetcandidate_1_rank_003_pae.png Debug: Rank: Not Available

Do you have an example of a correct filename that I can transform my files into?

Thanks so much for your time and help

— Reply to this email directly, view it on GitHub https://github.com/flyark/AFM-LIS/issues/6#issuecomment-2270379764, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGHFEW7SSF45RNUUX5PN7TTZQBJEDAVCNFSM6AAAAABMBJGC5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZQGM3TSNZWGQ. You are receiving this because you commented.

mavericb commented 2 months ago

Try these. protein_1_protein_2_unrelaxed_rank_001_alphafold2_multimer_v3_model_3_seed_000.pdb protein1protein_2_scores_rank_001_alphafold2_multimer_v3_model_5_seed_000.json protein_1_protein_3_unrelaxed_rank_001_alphafold2_multimer_v3_model_5_seed_000.pdb protein1protein_3_scores_rank_001_alphafold2_multimer_v3_model_1_seed_000.json protein_1_protein_4_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb protein1protein_4_scores_rank_001_alphafold2_multimer_v3_model_4_seed_000.json protein_1_protein_5_unrelaxed_rank_001_alphafold2_multimer_v3_model_4_seed_000.pdb protein1protein_5_scores_rank_001_alphafold2_multimer_v3_model_3_seed_000.json protein_1_protein_6_unrelaxed_rank_001_alphafold2_multimer_v3_model_3_seed_000.pdb protein1protein_6_scores_rank_001_alphafold2_multimer_v3_model_3_seed_000.json protein_1_protein_7_unrelaxed_rank_001_alphafold2_multimer_v3_model_3_seed_000.pdb protein1protein_7_scores_rank_001_alphafold2_multimer_v3_model_3_seed_000.json protein_1_protein_8_unrelaxed_rank_001_alphafold2_multimer_v3_model_3_seed_000.pdb protein1protein_8_scores_rank_001_alphafold2_multimer_v3_model_3_seed_000.json protein_1_protein_9_unrelaxed_rank_001_alphafold2_multimer_v3_model_3_seed_000.pdb protein1protein_9_scores_rank_001_alphafold2_multimer_v3_model_3_seed_000.json protein_1_protein_10_unrelaxed_rank_001_alphafold2_multimer_v3_model_3_seed_000.pdb protein1protein_10_scores_rank_001_alphafold2_multimer_v3_model_1_seed_000.json protein_1_protein_11_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb protein1protein_11_scores_rank_001_alphafold2_multimer_v3_model_3_seed_000.json On Aug 6, 2024, at 12:53 AM, mavericb @.***> wrote: Hey there, thanks so much for your answer!! I renamed the files using the convention suggested. target_candidate_1_rank_001.pdb target_candidate_1_rank003.pdb targetcandidate_1_rank_005.pdb target_candidate_1_rank002.pdb targetcandidate_1_rank_004.pdb In particular, I created a script to rank by average pLDDT gained from files, but nothing is generated, even though I don't get any errors. I think the problem is that the naming convention is still not compatible. In fact, I get: Debug: protein_2_temp: .pdb Debug: protein_1: target, protein_2: candidate_1_rank_003, pae_filename: renamed+targetcandidate_1_rank_003_pae.png Debug: Rank: Not Available Do you have an example of a correct filename that I can transform my files into? Thanks so much for your time and help — Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGHFEW7SSF45RNUUX5PN7TTZQBJEDAVCNFSM6AAAAABMBJGC5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZQGM3TSNZWGQ. You are receiving this because you commented.

Thank you for your answer We ended up writing a custom filtering script based on pLDDT, pAE, and RMSD. We'll try again in the future Thanks for your work!