AssertionError: missing rep subunits for {subunits}

mpm896 commented 1 year ago

Hello,

I've been working on using CombFold to attempt to model a larger complex, which I have previously ran all the subunits through AlphaPulldown (AlphaFold-Multimer) for prediction of binding partners. I'm starting from "Stage 4 - Combinatorial Assembly" and so I've assembled a subunits.json for all the predictions with good inter-PAE scores (instead of pTM+ipTM). In the first step, Searching for subunits in supplied PDB files, it seems to be ok but at the next step Extracting representative subunits it fails with the error AssertionError: missing rep subunits for{'TGGT1_284620_1000-2000', 'TGGT1_284620_1801-2333', 'TgCGP_3500-4000'}. This is just 3 of 48 subunits and so I thought maybe something was messed up in subunits.json but these 3 subunits seem ok. I'm attaching my subunits.json here (as a .txt) and would really appreciate any advice, thank you!

subunits.txt

Matt

mpm896 commented 1 year ago

I also tried this on the Google Colab notebook and got the same error

ben-shor commented 1 year ago

Hi,

This error usually means that for some of the subunits defined in your subunits.json, the script was not able to find structure in any of the models provided (in your case, it is for the 3 subunits: {'TGGT1_284620_1000-2000', 'TGGT1_284620_1801-2333', 'TgCGP_3500-4000'}). So I would usually ask you to ensure that there is at least one structure model in the pdbs folder containing these subunits in full.

However, in your case, I can see that in your subunits.json you have defined overlapping subunits, for example, TGGT1_284620_1000-2000 has 200 overlapping amino acids with TGGT1_284620_1801-2333. I believe this is what causes issues. In the subunits.json, no overlaps are allowed between subunits. I would recommend you to remove the subunits that are derived from larger subunits (such as TGGT1_209200_1-500,TGGT1_209200_200-753,TGGT1_306350_1-400,TgAKMT_250-550,TgCGP_4001-4500,TgFRM1_4401-5009,TgICAP16_1-800,TgICAP16_500-1323,TgPCR4_200-400,TgPCR7_1-300,TgPCR7_1000-1404, And to redevide TgPCR2,TgPCR3 subunits so they won't have any overlaps. I may have missed some other overlapping subunits so after doing this, best making sure that you don't have any other overlaps.

2 Things to notice when defining the subunits

Even if some of your structure models have only structures for the "derived subunits", the code will be able to use them during the assembly, so it is recommended to set the subunits as large as you can for better performance.
That said, each subunit should be modeled as a whole in at least one structure model.

I would also add an assertion in the code to make sure that there are no overlaps between subunits to prevent these mixups.

Let me know if it worked out for you.

Ben

mpm896 commented 1 year ago

Hi Ben,

Thanks so much for this advice. I divided the subunits with overlaps because some of the proteins are quite large and, when assembling multimers, required too much GPU memory for what we have. To account for the overlaps, should I just define the subunit in subunits.json as the beginning and end of that overlap? For example, for TGGT1_284620_1000-2000 and TGGT1_284620_1801-2333 just define it as 1 subunit with the sequence from 1000-2333? Because then I have pdbs with 1000-2000 and with 1801-2333 (I guess this goes along with the first of your 2 points, I just want to clarify). For your second point, with each subunit modeled as a whole, is a monomer model ok for this? If not I can attempt to run some additional multimer models. Thanks!

Matt

ben-shor commented 1 year ago

Hi,

It is completely fine and expected to divide large chains into smaller subunits and generate models with overlaps.

I will start by clarifying that when said "each subunit should be modeled as a whole" I mean that one of the chains in one of the AFM models should be the complete subunit (or contain the complete subunit sequence). It should also work with a model which is just a monomer of the subunit, but I have never tried it, so some issues I am not aware of may arise. However, it does mean that if you have a model of TGGT1_284620_1000-1800 & TGGT1_284620_1801-2333 it will not suffice for defining a subunit TGGT1_284620_1000-2333.

So, you can define the two subunits as TGGT1_284620_1000-1800 & TGGT1_284620_1801-2333. In this case, models that contain as one of their chains a structure for TGGT1_284620_1000-2000 will be used to find transformations for both of those subunits during assembly(so no "data" is lost). Of course, you can alternatively define them as TGGT1_284620_1000-2000 & TGGT1_284620_2001-2333 and then models with TGGT1_284620_1801-2333 will be used to find transformations for both subunits.

If you also have a model with a chain of TGGT1_284620_1000-2333 you can also use it as a subunit, but it is not mandatory, and everything should work fine with either of the options to divide it.

Let me know if something is unclear/doesn't work.

Ben

mpm896 commented 1 year ago

Hi Ben,

I was able to fix the subunits file - one of the subunits gave an error because I mislabeled the amino acid indexes, but as for the rest I fixed the overlaps. It was able to run past the extraction of representative subunits, but in the end I got: --- Finished building unified representation --- Running combinatorial assembly algorithm, may take a while --- Finished combinatorial assembly, writing output models Could not assemble, exiting

I'm not sure why it just couldn't assemble. Let me know if there's anything I can provide to help solve this, thanks!

Matt

ben-shor commented 1 year ago

It is an option that there was not found any combination of transformations that can assemble a complex without reasonable clashes. In this case, the best solution will be to generate more AlphaFold jobs which will result in more transformations.

However, there may be an issue with the algorithm, so I can try to look at the logs and see if something seems wrong. It will be most helpful if you could send me the script's output (the one you got those prints from) + the file: {output_path}/_unified_representation/assembly_output/output.log

mpm896 commented 1 year ago

I've attached output.log, thanks! output.log

ben-shor commented 1 year ago

It seems the issue is that the subunits TgPCR4_O & TgPCR5_P only have models with each other, but not with other subunits, so there are not enough transformations to assemble the complex.

It could be because those models really are missing, and in that case, you should run both of them in models with other subunits (either all other subunits or a subset of subunits if you have prior knowledge of interactions).

If you already have those kinds of models, it will be helpful if you can also attach the entire log outputted from the script.

mpm896 commented 1 year ago

That makes sense, I used AlphaPulldown to predict interactions between two subunits but then never input complexes to predict with additional subunits. PCR4 and PCR5 were only predicted to fold well with each other- but your suggestion to use this complex as input with others is a good one. The log file I attached is the full file. I'll try that out and try again! Thanks so much for your advice.

ben-shor commented 1 year ago

Great! Just 2 side notes:

For best results, it is recommended to include all models generated as input to CombFold assembly, not only those that were predicted well together. The algorithm scores and prioritize transformations based on AlphaFold scores, but sometimes even low-scoring models can be correct and used if they fit the complete complex. The main benefit to using only a subset of models is that the assembly will take less time, so if that is not an issue I would recommend against it.
The output.log (that you attached) is different from the outputted prints when running the script, however, it is not really important anymore as it seems we figured it out without it.

Let me know if you still have issues.

dina-lab3D / CombFold

AssertionError: missing rep subunits for {subunits} #1