dina-lab3D / CombFold

Apache License 2.0
68 stars 12 forks source link

Issue with FileNotFoundError When Predicting Complex with Repetitive Subunits #6

Closed genepearl closed 4 months ago

genepearl commented 4 months ago

Hi,

I hope this message finds you well. I'm currently working on predicting a complex that consists of a single subunit and 26 copies of it. Unfortunately, I've encountered a FileNotFoundError: [Errno 2] No such file or directory: '/content/tmp_assembled/assembled_results' error during the process.

As a workaround, I attempted to modify the structure by adding an extra G and extra GG to each of the subunits accordingly, aiming to differentiate them and potentially bypass the issue. My json-file ended up looking like this:

{ "A0": { "name": "A0", "chain_names": [ "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M" ], "start_res": 1, "sequence": "FTEEEIKKIRESLKLSVEALEVTPKDFEKALELLEEVAINLMEIFKDDPMKALKIAFKFTNAIAKLYVAHESKDVADAMAIMAEVTKYILEILEKVLEEG" }, "G0": { "name": "G0", "chain_names": [ "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z" ], "start_res": 1, "sequence": "FTEEEIKKIRESLKLSVEALEVTPKDFEKALELLEEVAINLMEIFKDDPMKALKIAFKFTNAIAKLYVAHESKDVADAMAIMAEVTKYILEILEKVLEEGG" } }

I generated pdbs for each of the subunits using AFM. However, this approach resulted in the same error.

Could you please provide guidance on how to resolve this? Any assistance would be greatly appreciated.

Thank you

ben-shor commented 4 months ago

Hi,

I assume you got this error as the assembly process has failed. This usually means that the models were not diverse enough and so the assembly algorithm was not able to combine different interactions to join all 26 subunits.

A few questions that will help me better understand:

  1. What models did you supply? how many copies of the subunit are in each model?
  2. Why exactly did you create a new separate subunit with an extra G? What does this workaround try to solve?
  3. Could you supply the file {output_path}/_unified_representation/assembly_output/output.log so that I could better identify the issue?
genepearl commented 4 months ago

Thank you for such a quick response.

  1. My original approach was to use a JSON file with a single subunit consisting of 26 copies of the same sequence. I used 15 models in total distributed as follows:

This did not work and led to this error

--- Searching for subunits in supplied PDB files
found full A0 in A0_A0_98374_unrelaxed_rank_004_alphafold2_ptm_model_5_seed_000.pdb chain A
found full A0 in A0_A0_A0_cb42a_unrelaxed_rank_002_alphafold2_ptm_model_4_seed_000.pdb chain A
found full A0 in A0_A0_A0_cb42a_unrelaxed_rank_003_alphafold2_ptm_model_3_seed_000.pdb chain A
found full A0 in A0_A0_A0_cb42a_unrelaxed_rank_005_alphafold2_ptm_model_5_seed_000.pdb chain A
found full A0 in A0_A0_A0_cb42a_unrelaxed_rank_004_alphafold2_ptm_model_2_seed_000.pdb chain A
found full A0 in A4_be333_unrelaxed_rank_005_alphafold2_ptm_model_2_seed_000.pdb chain A
found full A0 in A0_A0_98374_unrelaxed_rank_003_alphafold2_ptm_model_4_seed_000.pdb chain A
found full A0 in A0_A0_A0_cb42a_unrelaxed_rank_001_alphafold2_ptm_model_1_seed_000.pdb chain A
found full A0 in A0_A0_98374_unrelaxed_rank_001_alphafold2_ptm_model_1_seed_000.pdb chain A
found full A0 in A4_be333_unrelaxed_rank_003_alphafold2_ptm_model_5_seed_000.pdb chain A
found full A0 in A0_A0_98374_unrelaxed_rank_005_alphafold2_ptm_model_2_seed_000.pdb chain A
found full A0 in A4_be333_unrelaxed_rank_002_alphafold2_ptm_model_1_seed_000.pdb chain A
found full A0 in A4_be333_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb chain A
found full A0 in A0_A0_98374_unrelaxed_rank_002_alphafold2_ptm_model_3_seed_000.pdb chain A
found full A0 in A4_be333_unrelaxed_rank_004_alphafold2_ptm_model_4_seed_000.pdb chain A
--- Extracting representative subunits (for each subunit, its best scored model in the PDBs folder)
rep A0 has plddt score 61.587373737373746
--- Extracting pairwise transformations between subunits (from each PDB file with 2 or more subunits)
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A0_A0_98374_unrelaxed_rank_004_alphafold2_ptm_model_5_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A0_A0_A0_cb42a_unrelaxed_rank_002_alphafold2_ptm_model_4_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A0_A0_A0_cb42a_unrelaxed_rank_003_alphafold2_ptm_model_3_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A0_A0_A0_cb42a_unrelaxed_rank_005_alphafold2_ptm_model_5_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A0_A0_A0_cb42a_unrelaxed_rank_004_alphafold2_ptm_model_2_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A4_be333_unrelaxed_rank_005_alphafold2_ptm_model_2_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A0_A0_98374_unrelaxed_rank_003_alphafold2_ptm_model_4_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A0_A0_A0_cb42a_unrelaxed_rank_001_alphafold2_ptm_model_1_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A0_A0_98374_unrelaxed_rank_001_alphafold2_ptm_model_1_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A4_be333_unrelaxed_rank_003_alphafold2_ptm_model_5_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A0_A0_98374_unrelaxed_rank_005_alphafold2_ptm_model_2_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A4_be333_unrelaxed_rank_002_alphafold2_ptm_model_1_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A4_be333_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A0_A0_98374_unrelaxed_rank_002_alphafold2_ptm_model_3_seed_000.pdb
- Extracting pairwise transformations from file /content/CombFold-master/custom/pdbs/A4_be333_unrelaxed_rank_004_alphafold2_ptm_model_4_seed_000.pdb
--- Finished building unified representation
--- Running combinatorial assembly algorithm, may take a while
--- Finished combinatorial assembly, writing output models
Could not assemble, exiting
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
[<ipython-input-8-c171dffdddcc>](https://localhost:8080/#) in <cell line: 40>()
     38                                max_results_number=int(max_results_number))
     39 
---> 40 shutil.copytree(os.path.join(tmp_assembled_folder, "assembled_results"),
     41                 assembled_folder)
     42 

[/usr/lib/python3.10/shutil.py](https://localhost:8080/#) in copytree(src, dst, symlinks, ignore, copy_function, ignore_dangling_symlinks, dirs_exist_ok)
    555     """
    556     sys.audit("shutil.copytree", src, dst)
--> 557     with os.scandir(src) as itr:
    558         entries = list(itr)
    559     return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,

FileNotFoundError: [Errno 2] No such file or directory: '/content/tmp_assembled/assembled_results'
  1. My unsuccessful attempt to use CombFold for a homomeric complex led me to believe it's more suited for heteromeric complexes. Consequently, I modified my strategy to fit this understanding by creating two distinct subunits from my original sequence: one with an added glycine and another with two added glycines, assuming these small changes wouldn't significantly alter the overall structure. I matched the number of models to those in the provided example, and the simulation is currently running. Is there an alternative approach that allows for the use of only a single subunit?
ben-shor commented 4 months ago

Hi, In your use case, it is much better to keep all subunits identical and not try to force heteromeric configuration, and also it will not work if you don't have at least one model with the complete altered subunit(including additional G). First of all, notice that the logs you provided are different from the log on {output_path}/_unified_representation/assembly_output/output.log so please supply it as well.

Another issue I can see in the logs is that it seems that the subunit A0 appear only once in each model. for example in the line: found full A0 in A0_A0_98374_unrelaxed_rank_004_alphafold2_ptm_model_5_seed_000.pdb chain A I would expect to see another line after that: found full A0 in A0_A0_98374_unrelaxed_rank_004_alphafold2_ptm_model_5_seed_000.pdb chain B Is there actually a chain named B (or something else other than A) that has an identical sequence to A0 in that model?

Another 2 tips that may yield better results:

  1. I can see in the logs that the representative structure has a low average plddt (61.587), which may mean that it has a lot of disordered amino acids, in that case it may better to divide the subunits into two smaller subunits. For example if the subunit has a length of 600 and the amino acids in positions 300-400 are disordered, you can define subunit A0 on the 1-300 amino acids and A1 on 400-600. If the disordered amino acids are at the start or end of the chain, you can simply remove them.
  2. It may be beneficial to try and predict a structure of size 13 instead of 26. in case the structure has dihedral symmetry of 2 rings of 13, and so it is unlikely for AFM to predict the secondary interaction and so you are more likely to assemble a single ring.
genepearl commented 4 months ago

Hi,

I appreciate the guidance you've provided. Following your advice, I've opted to move away from using forced heteromeric configurations in favor of homomeric configurations. Consequently, there's no longer a need to review the log at {output_path}/_unified_representation/assembly_output/output.log

I'm now focusing on exploring the application of your tool for homomeric complexes, specifically with the sequence: "FTEEEIKKIRESLKLSVEALEVTPKDFEKALELLEEVAINLMEIFKDDPMKALKIAFKFTNAIAKLYVAHESKDVADAMAIMAEVTKYILEILEKVLEE." I'm interested in understanding the process for predicting structures that comprise of 13 (or 26) copies of this sequence. Could you clarify if utilizing a single subunit is the optimal strategy? Moreover, how can diversity among the models be maintained when using only one type of subunit? Your detailed explanation of these aspects would be greatly appreciated.

ben-shor commented 4 months ago

Well, it seems that this subunit is pretty small, so actually I think a better approach would be to use AFM directly on either 13 or 26 copies of the subunit, as it should be pretty accurate and shouldn't require many resources (you can probably even do this in Colab). For homomers with a small subunit, it is likely that AFM won't be able to predict the dimer interaction that forms symmetry accurately when given only a subcomplex, so CombFold is less likely to work.

If you are still looking for an assembly-based approach, you can use tools like SymDock that takes a single copy of the subunit structure and the number of copies and find possible symmetric structures of that size.

Hope this helps!