Open russell-taylor opened 3 years ago
Repository https://github.com/ReliaSolve/suitename_regression has the initial commit for this that can pull down all mmCIF and validation XML records and extract the RNAsuiteness value from those that have them.
See https://github.com/ReliaSolve/Molprobity2/issues/132 for info on how to get the info from each source.
(done) Must handle bogus unit cells on the input.
Dorothee taught me how to fix that on my code by inserting the following snippet between getting the model and interpreting the model:
# Fix up bogus unit cell when it occurs by checking crystal symmetry.
cs =model.crystal_symmetry()
if (cs is None) or (cs.unit_cell() is None):
model = shift_and_box_model(model = model)
The exception in mp_geo comes inside the call to pdb_interpretation.process(), and there is yet not a model to adjust.
Asked Christopher for advice on how to proceed and will try replacing the process() call with a series of calls if I don't hear back.
Christopher pointed out that dangle and mp_geo are the only ways that existed to get these numbers, that he was pretty sure that the PDB was using mp_geo, and that there were changes made to CCTBX since March 2020 that have made it more restrictive, so that they are probably using an older version of mp_geo to generate their suites.
Attempting to build phenix from source using a late-2019 cctbx_project did not result in a good build (and it pulled the latest version of cctbx_project, so we may be getting a Franken-build. Will try looking for installers for old versions of Phenix.
Installing phenix-1.17.1-3660 (10/16/2019) did not get rid of the unit cell errors. This was before the version that did extra cleanup testing, but may not have been as early as the version the PDB is using (we don't know which version they are using).
Another approach would be to clean up the model and write a new CIF file with the correct unit cell and then feed that model into mp_geo. We can do this using iotbx.pdb.box_around_molecule on all input files, which slows us down even more but appears to solve the bogus unit cell problem.
The current script displays the two values, one to three significant digits. There is often a lot of difference between the two so we'll need to take a closer look before we automate the comparison.
The zero scores we were seeing were the result of mp_geo failing with unit cell issues.
Many of the others round to the same value with two significant digits, but others are off by up to 0.07 when rounded (2awq) or 0.15 (2awe).
The printf we're using for rounding does Banker's rounding, so gets the wrong answer for 0.5 sometimes. We'll want to switch to a bc -l call that compares two numbers to see if they are larger than 0.5 different when comparing difference magnitude.
@todo: Different kind of error:
Traceback (most recent call last):
File "/home/taylorr/rlab/phenix/build/../modules/cctbx_project/mmtbx/command_line/mp_geo.py", line 7, in <module>
mp_geo.run(sys.argv[1:])
File "/home/taylorr/rlab/phenix/modules/cctbx_project/mmtbx/validation/molprobity/mp_geo.py", line 79, in run
pdb_file_def="mp_geo.pdb")
File "/home/taylorr/rlab/phenix/modules/cctbx_project/iotbx/phil.py", line 158, in __init__
custom_processor=self)
File "/home/taylorr/rlab/phenix/modules/cctbx_project/libtbx/phil/command_line.py", line 165, in process_and_fetch
sources = self.process(args=args, custom_processor=custom_processor)
File "/home/taylorr/rlab/phenix/modules/cctbx_project/libtbx/phil/command_line.py", line 154, in process
return self.process_args(args=args, custom_processor=custom_processor)
File "/home/taylorr/rlab/phenix/modules/cctbx_project/libtbx/phil/command_line.py", line 137, in process_args
libtbx.phil.parse(file_name=arg) # exception expected
File "/home/taylorr/rlab/phenix/modules/cctbx_project/libtbx/phil/__init__.py", line 2181, in parse
primary_parent_scope=result)
File "/home/taylorr/rlab/phenix/modules/cctbx_project/libtbx/phil/parser.py", line 131, in collect_objects
word_iterator.pop_unquoted().assert_expected("=")
File "/home/taylorr/rlab/phenix/modules/cctbx_project/libtbx/phil/tokenizer.py", line 144, in assert_expected
O.raise_syntax_error('expected "%s", found ' % value)
File "/home/taylorr/rlab/phenix/modules/cctbx_project/libtbx/phil/tokenizer.py", line 140, in raise_syntax_error
'Syntax error: %s"%s"%s' % (message, O.value, O.where_str()))
RuntimeError: Syntax error: expected "=", found "_entry.id" (file "./tmp.cif", line 3)
Error running mp_geo on 1jo7 (58 failures out of 214)
When I run iotbx.cif_as_pdb on this file, it reports an error that "Invalid unit cell parameters are given", so maybe the root cause of this is the same and the fix to the unit cell will repair this problem as well.
@todo: Look into numeric differences found in files that run properly.
There is not an alternate field from the SuiteName output that has a value that matches the one from the PDB.
Given the close numbers, it seems plausible that many of the PDB is using a different number of suites and thus getting a different average score. However, some are very different (0.08 vs. 0.229 for 6do9).
@todo: Different error: Sorry: number of groups of duplicate atom labels: 22331 in 6n1d
We can run iotbx.cif_as_pdb on this file and it produces a PDB file. We get the same error above when we run mmtbx.mp_geo on the PDB file. Running iotbx.pdb.box_around_molecule does not complain, but it also does not fix the problem.
Christopher has a different script at https://github.com/rlabduke/rna_precis/blob/main/rna_scripts/rna_suite_parameters.py that may be more robust. It is also possible that the PDB is still using dangle to generate its suites, so we can give that a try as well. @todo: When I convert 6gc0 to a PDB file using iotbx.cif_as_pdb and then run it through a version of Christopher's program that has Dorothee's fix added to it, it does produce suites. It takes a LONG TIME to run. However, cif_as_pdb fails when the unit cell is not correct, so this is not a path we can take to get there.
There is a dangle.jar file in Molprobity/lib that seems to contain a list of Java classes, one of which is dangle, so it looks like it was a Java script. Running dangle does not get matching values for many files; almost all of them report no suites and so get a value of 0. Some rare cases had nonzero values. It doesn't look like dangle is the solution.
@todo: Another kind of Sorry: "Conflicting angle restraints"
The PDB validation report lists the version of Molprobity being used as 4.02b-467 and lists its properties as "num_bonds_rmsz,angle-outlier,RNAsuite,RNAsuiteness,RestypesNotcheckedForBondAngleGeometry,num_angles_rmsz,bonds_rmsz,RNAscore,angles_rmsz,bond-outlier,clashscore,num-H-reduce,clash" There is a molbprobity_4.2 branch, which can be checked out whose latest commit was in 2015. That version runs the following command to get the output:
exec("mmtbx.mp_geo rna_backbone=True pdb=$infile | phenix.suitename -report -pointIDfields 7 -altIDfield 6 > $outfile");
The Molprobity 4.2 commit date was Dec 18, 2015. The closest Phenix to that date is 1.10.1-2155 on 2015/10/02. Downloading that version to see if it is able to parse all of the files and gets the same answers.
Installed Phenix 1-10. When running mmtbx.mp_geo rna_backbone=True pdb=1jo7.cif
we get an invalid unit cell error. It is definitely running the 1-10 version of the script. When we adjust the compare.sh script to use the iotbx.pdb.box_around_molecule from that version and run it, we get no differences or errors in the first 1000 files and we get the same 2 of 16 differences that we get when running the current version of Phenix. It then went on to behave slightly differently, getting 21 different and 4 failed of 92/4879 as opposed to 18 different and 7 failed for the current version.
@todo: The XML validation report lists mogul_rmsz_numangles as 8, which may match the number of analyzed suites in the PDB table, and mogul_rmsz_numbonds as 9, which may match the number of analyzed nucleotides? Nope: these are 14 and 15 in 2vnu XML but 8 and 10 in its PDB
Trying Ken's myangle.py script to see if we could get it to generate good angles for us, but it is crashing. It is in ~/rlab/cctbx_reliasolve on my home server.
He fixed the crash and I modified it to read directly from the CIF file to avoid another problem with it not reading from standard input. It is getting some failures (presumably due to it not yet handling alternate conformations) and some differences, but some are the same.
It gets around half of the values correct to two significant digits.
@todo: Once the CCTBX version can handle dangle input from standard input, harness it into the existing pipeline and see if it gets the same answers as the C code for the ones that work.
Christopher was able to run 1hhx.pdb through mmtbx.mp_geo without any errors. When I run the PDB file (as opposed to the CIF file), I also get no errors. Okay, it looks like the problem is with CIF files being somehow different than the corresponding PDB file in a way that breaks things.
(nope) Try reading the PDB files (which are not always available?) instead of the CIF files for each molecule and see what happens in that case.
@todo: Running iotbx.cif_as_pdb to convert from CIF to PDB produces a 1hhx.pdb that also works, and seems to produce the same output from mmtbx.mp_geo that we got from the original PDB file. This got no errors through at least the first 2000 files, though we are still seeing differences. (We do get an error on 4qln, on both the cif_as_pdb and the original pdb files.)
Running Phenix 1-10 using the cif_as_pdb approach gets more different files (and larger differences on the two first files that they both differ on) than running the current version of Phenix. It also gets more failures than the current version.
Should I try CIF files on Suitename? I haven't, yet. On 6/16/2021 1:52:49 PM, Russell Taylor @.***> wrote: Christopher was able to run 1hhx.pdb through mmtbx.mp_geo without any errors. When I run the PDB file (as opposed to the CIF file), I also get no errors. Okay, it looks like the problem is with CIF files being somehow different than the corresponding PDB file in a way that breaks things. @todo [https://github.com/todo]: Try reading the PDB files (which are not always available?) instead of the CIF files for each molecule and see what happens in that case. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [https://github.com/ReliaSolve/Molprobity2/issues/150#issuecomment-862585374], or unsubscribe [https://github.com/notifications/unsubscribe-auth/ABNEMR55PCXHFQIJYLOEDXDTTDQG5ANCNFSM46MYXILQ].
@Joymaker You can leave it up to me to test these.
@todo: See if the source code for the PDB processes are published, which would be a way for us to find out how they are computing the values in the validation reports.
Putting this on the back burner until we hear back from the PDB and until we've completed porting SuiteName, Probe, and Reduce.
Find a way to programmatically or hand-sampled check the results against the "RNA backbone" summary numbers and the list of outliers in the text report obtained by clicking on the summary image.
If we download the validation data in XML format from the PDB, then we can look for the tag RNAsuiteness="0.88" (for file 402d, not all files have this entry?). To mirror the .xml.gz files: rsync -rlpt -v -z --delete --port=33444 --include "/" --include ".xml.gz" --exclude "*" rsync.rcsb.org::ftp/validation_reports/ ./validation_reports
If we download the model files in CIF format, we can run on all of them. To mirror the .gz files: rsync -rlpt -v -z --delete --port=33444 rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ ./mmCIF