Inconsistent RMSD values with PDB2SQL

FarzanehParizi commented 2 years ago

Expected Behavior: The RMSD values checked to be correct

Current Behavior: Checking some random cases they have different RMSDs (backbone L-RMSD) than expected PDB: 1KLG

with PDB2SQL
- 'top_molpdf_BB_lRMSD_BM': 3.809,
- 'best_BB_lRMSD_BM': 3.179,
with PROFIT
- 'top_molpdf_BB_lRMSD_BM': 1.346,
- 'best_BB_lRMSD_BM': 1.234,

NicoRenaud commented 2 years ago

Hi @FarzanehParizi thanks for reporting it. Would you have the pdb2sql script you used or is it done deep inside pandora ?

DarioMarzella commented 2 years ago

Hi @NicoRenaud , maybe Farzaneh can confirm, but the function in PANDORA that takes care of it (by using pdb2sql) is this one.

NicoRenaud commented 2 years ago

@DarioMarzella Thanks ! @FarzanehParizi is it possible to share the pdbs you are using ?

FarzanehParizi commented 2 years ago

Hi @NicoRenaud, Many thanks for taking care of the issue. Here are the PDBs for the 1KLG case. top_molpdf: 1KLG.BL00050001.pdb best_BB_lRMSD: 1KLG.BL00070001.pdb

https://drive.google.com/file/d/1FGRmohC0rI_SxR_FwfJCThw_6LM40rMp/view?usp=sharing https://drive.google.com/file/d/1AzirU0O46zEIl18CiPwEzNt9UVLQVs_D/view?usp=sharing Here is the target structure from IMGT: https://drive.google.com/file/d/1lONoI6J6OIgc9sCWQ_SA7rBptDgw5WVu/view?usp=sharing

As @DarioMarzella mentioned PANDORA codes are used for the RMSD calculations.

FarzanehParizi commented 2 years ago

Another case: 1L05

with PDB2SQL
'top_molpdf_BB_lRMSD_BM': 1.87,
'best_BB_lRMSD_BM': 1.864,
with PROFIT 'top_molpdf_BB_lRMSD_BM': 0.829, 'best_BB_lRMSD_BM': 0.729,

top_molpdf: 1LO5.BL00160001.pdb best_rmsd (profit): 1LO5.BL00130001.pdb best_rmsd (pdb2sql): 1LO5.BL00070001.pdb

https://drive.google.com/file/d/1Wu6Nem3r_AINbdDLrXuTIrIktgR5ufsE/view?usp=sharing https://drive.google.com/file/d/1dU_vpVEbb25pxr3Q2Yt0Wopdb6TY1dXi/view?usp=sharing https://drive.google.com/file/d/1UDBLD8GBMnAJLqckEfW4zVNA0df3-exw/view?usp=sharing

target_structure: https://drive.google.com/file/d/1qTgfPmchzChBgbT9PrfAE_sq-P5j6t0e/view?usp=sharing

NicoRenaud commented 2 years ago

Thanks @FarzanehParizi ! I ll try to take a look at it this week but it is a busy week. When we started pdb2sql we had a benchmark set to compare irmsd values with profit. I think we should use and further extend that benchmark so that we can periodically check irmsd values. I ll try to dig that up and upload it on a separate repo

DarioMarzella commented 2 years ago

Relted to issue #99 (single PDBs are not reported there though)

NicoRenaud commented 2 years ago

@FarzanehParizi it seems that there are 3 chains in the pdbs. At the moment pdb2sql can't handle that (we have a PR in progress to fix that). Which chains are you using to compute the irmsd and lrsmd ?

DarioMarzella commented 2 years ago

The two chains get cut and merged here using biopythonn and they are printed to a "_decoy" and "_ref" file. @FarzanehParizi can you please provide those two files and compare them with the chopped ones provided to PROFIT? We use them to calculate the L-RMSD only, no i-RMSD.

FarzanehParizi commented 2 years ago

The folder to the files ("_ref" and "_decoy") Dario mentioned are as follows 1LO5: https://drive.google.com/drive/folders/1260d2NnC-EPaxnsZIDauM_6_9dtXDC0e?usp=sharing 1KLG: https://drive.google.com/drive/folders/1s9RSMvHFHbgaq33-Q1KnCdZYaHYrw5fv?usp=sharing

NicoRenaud commented 2 years ago

great thanks ! I'll take a look at it this week

NicoRenaud commented 2 years ago

I've uploaded the use cases here and there . You should be able to pull pdb2sql_benchmark and execute the scripts in these directories.

I'm unable to replicate the large values of the l-RMSD that you are reporting at the start of this issue. The value of the l-RMSD obtained for BL00130001 is actually identical to the one you obtained with PROFIT (0.729). The other values are slightly different but not that much either.

Could you try to run the script and see if you are still seeing large differences between PROFIT and pdb2sql ?

FarzanehParizi commented 2 years ago

Thanks @NicoRenaud, for checking the cases. I need to clarify that these are the Ligand-RMSD values (Superimposing chain M and calculating chain P RMSD), not Interface-RMSD. Is the value you reported which is similar to PROFIT an L-RMSD or I-RMSD value?

NicoRenaud commented 2 years ago

oh I see the unclarity with the capitalization of L and i :) it is the L-rmsd that I checked. Now I didn't specify which chain to superimpose and I actually don't know how pdb2sql decided to do that ... I'll double check and try to make sure that I superimpose the M chains.

DarioMarzella commented 2 years ago

@NicoRenaud Would it help if we give you a full output folder, including the Pandora objects, so you can directly use the calc_LRMSD function so you don't have to superpose, renumber etc? Or do you prefer testing it separately in pdb2sql only?

NicoRenaud commented 2 years ago

@DarioMarzella Thanks that would help I think :) I didn't do any renumbering or anything fancy in pdb2sql atm. So it might be that this also changes the result. So yes please :) If you can directly upload all the files in the pdb2sql_benchmark repo in the corresponding use cases that's great :)

FarzanehParizi commented 2 years ago

@NicoRenaud I have git added the files to the pdb2sql_benchmark repo but seems I do not have access rights to the repository for my push request.

NicoRenaud commented 2 years ago

I thought you would have access to it as a member of DeepRank ... apparently not. I just sent you an invite so once you accept it you should be able to push :)

FarzanehParizi commented 2 years ago

@NicoRenaud I have added a python script (_rmsd_calcpandora.py) so that you can replicate the results we get with PANDORA functions. The usage of the script is written in the file. (Sorry for the messy code as I just copied some part of our previously written code)

NicoRenaud commented 2 years ago

Thansk ! I'll take a look asap

NicoRenaud commented 2 years ago

I just noticed that you are using compute_lrmsd_pdb2sql in PANDORA and I was testing compute_lrsmd_fast. I would advise to use compute_lrsmd_fast as its faster and I tested it more. Plus with all the check in place now it should be ok.

However the two methods give different results on your test case and I don't know why. I'll investigate, but in the mean time if you could try swapping the method and see if that helps that would be great

DarioMarzella commented 2 years ago

The reason why we used compute_lrmsd_pdb2sql had to do with the lzone file. I cannot remember now if it was because pdb2sql would not take the type of lzone file we were giving it or something similar. We even have a function to calculate the lzone (get_Gdomain_lzone) but in the end we opted for just removing the extra domain (with remove_C_like_domain) so compute_lrmsd_pdb2sql could not get the lzone file wrong.

NicoRenaud commented 2 years ago

I found and fixed one issue in the way we ensure that we have the same atoms in compute_lrsmd_pdb2sql. Now compute_lrmsd_pdb2sql and compute_lrmsd_fast return the same values which is better. Could you fetch and checkout the fix_lrmsd branch of pdb2sql and see if that solved your problem ?

FarzanehParizi commented 2 years ago

@NicoRenaud By cloning the fix_lrmsd branch, still I get the same RMSD values as before with compute_lrsmd_pdb2sql

NicoRenaud commented 2 years ago

hmmm that's strange. Are you sure you are using the local install of pdb2sql ?

FarzanehParizi commented 2 years ago

Sorry, I was busy with meetings and I realized I forgot to activate the new environment I had with the new fix_lrmsd branch. I ran into an error in PANDORA I will fix it and report again the RMSD.

FarzanehParizi commented 2 years ago

With the new fix_lrmsd in pdb2sql now there is the error message that ref and decoy do not have the same residues. Previously this has been passed and we thought the two files have identical numbering and residues. We retrieved an RMSD value which was considered to be right. I have found where our models have inconsistency, this happens after we remove a C-like domain and merge the two chains. I am currently working to fix that in PANDORA to trim models for the inconsistencies (usually only one residue invokes this problem).

X-lab-3D / PANDORA

Inconsistent RMSD values with PDB2SQL #166