dina-lab3D / CombFold

Apache License 2.0
68 stars 12 forks source link

How to use more than 26 chains for prediction #4

Closed CalvinKlein96 closed 6 months ago

CalvinKlein96 commented 6 months ago

Hi Ben,

I've been using Combfold for predictions up to 21 chains and it worked well. Now I tested it for a prediction with 33 chains by naming the chains A1 to A11, B1 to B11, etc. The run_on_pdbs.py then gives an assertion error for badly named chains. I also tested renaming the chains to AA, AB, etc., but I got the same error.

From what I understand the problem comes from PDBIO which wants one chain ID letter to be specified. Is there a workaround to increase the number of chains that are assembled?

Cheers Calvin

ben-shor commented 6 months ago

Hi Calvin,

First of all - just yesterday I pushed a commit that handles bugs when applying CombFold for more than 31 subunits - so make sure to pull it and recompile the C++ code (make clean && make).

Regarding your issue - you can use both upper and lower-case letters, as well as digits, this sets a limit of at least 62 different chains in the model. I would recommend not using the same letter in upper and lower in the same subunit, as in some operating systems it can cause issues. We create files called _.pdb and if the OS is not case-sensitive, files of different chains of the same subunit will overwrite each other.

Let me know if that works for you.

Best Ben

CalvinKlein96 commented 6 months ago

Hi Ben,

so I pulled the new version of Combfold and compiled it, as well as changing my chain labels to some lowercase letters. I tried using the updated json file and new Combfold version on the previously predicted folds. However, I do get this error now:

File "/ibmm_data/kleinc/software/CombFold/scripts/run_on_pdbs.py", line 402, in <module>
    run_on_pdbs_folder(os.path.abspath(sys.argv[1]), os.path.abspath(sys.argv[2]), os.path.abspath(sys.argv[3]))
  File "/ibmm_data/kleinc/software/CombFold/scripts/run_on_pdbs.py", line 380, in run_on_pdbs_folder
    assembled_files = create_complexes(clusters_path, first_result=0, last_result=max_results_number,
  File "/ibmm_data/kleinc/software/CombFold/scripts/libs/prepare_complex.py", line 142, in create_complexes
    create_transformation_pdb(assembly_path, transforms_strs[i], output_path=output_path, output_cif=output_cif)
  File "/ibmm_data/kleinc/software/CombFold/scripts/libs/prepare_complex.py", line 108, in create_transformation_pdb
    _merge_models(output_path, tmp_pdb_path, output_path, output_cif=output_cif)
  File "/ibmm_data/kleinc/software/CombFold/scripts/libs/prepare_complex.py", line 22, in _merge_models
    model_struct1 = read_model_path(model_path1)
  File "/ibmm_data/kleinc/software/CombFold/scripts/libs/prepare_complex.py", line 17, in read_model_path
    return Bio.PDB.PDBParser(QUIET=True).get_structure("s_pdb", pdb_path)
  File "/ibmm_data/kleinc/software/Vader/localcolabfold/colabfold-conda/lib/python3.10/site-packages/Bio/PDB/PDBParser.py", line 100, in get_structure
    self._parse(lines)
  File "/ibmm_data/kleinc/software/Vader/localcolabfold/colabfold-conda/lib/python3.10/site-packages/Bio/PDB/PDBParser.py", line 123, in _parse
    self.trailer = self._parse_coordinates(coords_trailer)
  File "/ibmm_data/kleinc/software/Vader/localcolabfold/colabfold-conda/lib/python3.10/site-packages/Bio/PDB/PDBParser.py", line 198, in _parse_coordinates
    resseq = int(line[22:26].split()[0])  # sequence identifier
ValueError: invalid literal for int() with base 10: 'E'

It fails after having created the output_clustered_0.pdb already. Does this stem from the previous naming issue?

ben-shor commented 6 months ago

It is unclear to me if this is related to the naming issue... Could you upload the "subunits.json" file and possibly also a zip of

/_unified_representation/assembly_output of the failed run? If you have issues with including the entire folder, report how many of the subunits are present in output_clustered_0.pdb.
CalvinKlein96 commented 6 months ago

Sure, no problem, here you go:

assembly.zip subunits.json

In the output_clustered_0.pdb 4 chains are missing.

ben-shor commented 6 months ago

Thanks! It seems the issue is that your structure is very big, and the PDB format only supports up to 100,000 atoms. To handle this, the pipeline can output in CIF format, it is possible in the Colab Notebook however, it is still not accessible by a flag locally. So you currently have 2 options to make it work locally:

  1. rerun the pipeline, but change in scripts/run_on_pdb.py:326 so that output_cif=True.

  2. Without rerunning the entire pipeline (as the assembly itself worked fine, and only the generation of result files was faulty). change in scripts/libs/prepare_complex.py:125 to output_cif=True and then run: scripts/libs/prepare_complex.py <output_folder>/_unified_representation/assembly_output/output_clustered.res 1 10

This will create the results in the folder /_unified_representation/assembly_output Let me know if that worked.

CalvinKlein96 commented 6 months ago

That worked wonderfully for me, thanks!