crest-lab / crest

CREST - A program for the automated exploration of low-energy molecular chemical space.
https://crest-lab.github.io/crest-docs/
GNU Lesser General Public License v3.0
182 stars 42 forks source link

Cregen eliminates valid conformers as "clashes" #242

Open goxeq opened 7 months ago

goxeq commented 7 months ago

Hello,

While performing conformational search in crest (version 2.12, downloaded form github), it removes all structures as "clashes" even though they are completely fine by visual inspection.

Command which fails: crest struc.xyz --cregen crest_rotamers_2.xyz

Files (you must rename them from .xyz.txt to just .xyz, github did not allow me to upload xyz) crest_rotamers_2.xyz.txt struc.xyz.txt

The output is like this:

Command line input:
 > crest struc.xyz --cregen crest_rotamers_2.xyz

  --cregen : CREGEN standalone usage. Sorting file <crest_rotamers_2.xyz>
 Using only the cregen sorting routine.
 input  file name : crest_rotamers_2.xyz
 output file name : crest_rotamers_2.xyz.sorted
 number of atoms                :   21
 number of points on xyz files  :   4
 RMSD threshold                 :   0.1250
 Bconst threshold               :   0.0100
 population threshold           :   0.0500
 conformer energy window  /kcal :   6.0000
 # fragment in coord            :     1
 number of removed clashes      :     4
 # bonds in reference structure :    19
 number of reliable points      :     0
 reference state Etot :  6.939238432338885E-310
 running RMSDs...
 done.
 number of doubles removed by rot/RMSD         :           0
 total number unique points considered further :           0
       Erel/kcal        Etot weight/tot  conformer     set   degen     origin
T /K                                  :   298.15
E lowest                              :     0.00000
ensemble average energy (kcal)        :    0.000
ensemble entropy (J/mol K, cal/mol K) :    0.000    0.000
ensemble free energy (kcal/mol)       :    0.000
population of lowest in %             :    0.000
 number of unique conformers for further calc            0
 list of relative energies saved as "crest.energies"

 -----------------
 Wall Time Summary
 -----------------
              CREGEN wall time :         0h : 0m : 0s
--------------------
Overall wall time  : 0h : 0m : 0s

 CREST terminated normally.

I would like to note that this happened originally as a part of the whole conformational search, but I just narrowed it down to this particular step. The molecule was optimized with the following constraints, has charge -2, multiplicity 1 and was optimized with --alpb water with xtb with --tight optimization thresholds.

$constrain
force constant=1.0
angle: 2,1,4,90.0
angle: 2,1,6,90.0
angle: 2,1,8,180.0
angle: 4,1,6,180.0
angle: 4,1,8,90.0
angle: 6,1,8,90.0
$end
pprcht commented 7 months ago

Yeah, that's a bit odd. I was able to reproduce it... initially. After repeated tries, which included renaming the ensemble, it started to work, and has been since, without me being able to break it again.

I suspect it has something to do with an uninitialized reference energy. If you look at your output it says 6.939238432338885E-310 for this value, which is something that happens only if there is random data (i.e. the variable is uninitialized) in memory. I'm not 100% sure this is what was causing trouble, but just in case I'll look at it in the code.

goxeq commented 5 months ago

First of all, I think your later attempts to reproduce failed, because you may have inadvertently overwritten the struc.xyz file, since crest overwrites file with that name when it is run.

I looked at the problem more, and noticed, that the "clashes" are not eliminated when you change the reference structure (when I took the first structure from the rotamers file, it went. I tried to actually print variables in discardbroken subroutine by adding this to newcregen.f:612: write (*,*) "__debug:",distok,dissoc,frag,frag0 and here they are:

unchanged_struc: number of removed clashes      :     4
__debug: T T           2           1
__debug: T T           2           1
__debug: T T           2           1
__debug: T T           2           1
other_geometry_struc:no clashes
__debug: T F           2           2
__debug: T F           2           2
__debug: T F           2           2
__debug: T F           2           2

It means, that crest somehow thinks that the "new" structures are dissociated. If I undestand it correctly, once dissoc is set to .true. the structure is removed by the next if statement. I think this is not an easy problem to solve, metal-ligand bonds in general tend to be quite diverse.

I am quite surprised, that these "dissociation" checks are not just switched off when you put --notopo option on the command line (but it has not effect). I can see a better way to check topology changes might be to create a bond order matrix base on Pauling's formula (10.1021/ja01195a024) bond_order = exp((R_single_bond - R)/0.3) and then take difference between two bond matrices to see, if any differene is larger than one.

However, a solution not requiring to change the general procedure, which I suppose, is working well in general, since nobody else complains* about this besides me, would be just to make sure that discardbroken subrouting would ignore the atoms excluded by --notopo option.

(* the problem could be more widespread, since I only noticed it because I got zero structures. But if, by chance, some structures passed those check, it would be a "silent" error discarding correct structures)

Here is my "testing" script:

#!/usr/bin/env python3
from pathlib import Path
import shutil
import subprocess
d1=Path("unchanged_struc")
xyz1=Path("struc.xyz").read_text()

d2=Path("other_geometry_struc")
xyz2_lines=Path("crest_rotamers_2.xyz").read_text().split("\n  21")[0].split("\n")
xyz2_lines[1]="" #remove energy
xyz2="\n".join(xyz2_lines) 

import re

for d,xt in ((d1,xyz1),(d2,xyz2)):
    d.mkdir(exist_ok=True)
    Path(d/"struc1.xyz").write_text(xt)
    shutil.copy("crest_rotamers_2.xyz", d/"crest_rotamers_2.xyz")
    print(f"{d}:",end="")
    p=subprocess.run(["../crest/_build/crest","struc1.xyz","--notopo","--cregen","crest_rotamers_2.xyz","--notopo"],capture_output=True,cwd=d)
    out_tx=p.stdout.decode()
    m=re.search(".*clashes.*",out_tx)
    print(m.group() if m else "no clashes")#out_tx.split("number of")[-1].split("\n")[0])
    for i in re.findall("__debug:.*",out_tx):
        print(i)
    Path(d/"crest.out").write_text(out_tx)
pprcht commented 5 months ago

The topology check is separate from checking for dissociation and already employs a connectivity matrix. With the other reference (first structure from the ensemble) similar issues may still persist and one structure will still be kicked out by the topology. The dissociation check employs an older version of the "topology" based on covalent coordination numbers, which might cause additional issues here. Turning it off with --notopo is a good idea and I will adapt this asap, but what really should be done is use a topology that can more consistently identify the Zn-C bonds. Unfortunately, metal centers often cause this trouble in the topology, and I don't think there is a 100% fail proof way to do that. Turning off the topology is usually the way to go in this case.

goxeq commented 5 months ago

I attempted to fix it by using the mechanism to ignore some atoms what were already there for subRMSD. Could something like this work? It seems to work for my structures. But I cannot program in Fortran, so I may have missed something.

diff --git a/src/newcregen.f90 b/src/newcregen.f90
index b391f3f..6a00599 100644
--- a/src/newcregen.f90
+++ b/src/newcregen.f90
@@ -550,10 +550,19 @@ subroutine discardbroken(ch,env,nat,nall,at,xyz,comments,newnall)
     logical :: distok,distcheck
     real(wp) :: cnorm
     logical :: dissoc
+    logical,allocatable :: incl_atoms(:)
+    integer,allocatable :: incl_atoms_mask(:)

     !--- if we don't wish to include all atoms:
     substruc = (nat.ne.env%rednat .and.  env%subRMSD)

+    if (allocated(env%excludeTOPO))then
+        substruc=substruc .or. any(env%excludeTOPO)
+    endif
+    if (substruc)then
+        incl_atoms=(env%includeRMSD>0).and.(.not.env%excludeTOPO)
+        incl_atoms_mask=merge(1,0,incl_atoms)
+    endif
     !--- settings
     allocate(rcov(94))
     call setrcov(rcov)
@@ -569,9 +578,10 @@ subroutine discardbroken(ch,env,nat,nall,at,xyz,comments,newnall)
     write(ch,'('' # fragment in coord            :'',i6)')frag0
     deallocate(atdum)
     if(substruc)then
-        nat0=env%rednat
+        nat0=count(incl_atoms)
+        write (*,*) "__debug Using these atoms in discardbroken:",incl_atoms
         allocate(c0(3,nat0),at0(nat0),c1(3,nat0),atdum(nat0))
-        call maskedxyz(nat,nat0,cref,c0,at,at0,env%includeRMSD)
+        call maskedxyz(nat,nat0,cref,c0,at,at0,incl_atoms_mask)
     else
         allocate(c0(3,nat),at0(nat),c1(3,nat),atdum(nat))
         c0=cref
@@ -591,7 +601,7 @@ subroutine discardbroken(ch,env,nat,nall,at,xyz,comments,newnall)
        if(.not.substruc)then
            c1(:,:)=xyz(:,:,j)/bohr
        else
-           call maskedxyz(nat,nat0,xyz(:,:,j),c1,at,at0,env%includeRMSD)
+          call maskedxyz(nat,nat0,xyz(:,:,j),c1,at,at0,incl_atoms_mask)
            c1=c1/bohr
        endif
        distok=distcheck(nat0,c1) ! distance check
@@ -607,6 +617,7 @@ subroutine discardbroken(ch,env,nat,nall,at,xyz,comments,newnall)
           endif
        endif

+       write (*,*) "__debug Number of fragments/reference:",frag,"/",frag0
 !      if(dissoc .or. (abs(erj).lt.1.0d-6) .or. &
       if(dissoc .or. &
       &   (cnorm.lt.1.0d-6) .or. (.not.distok)    )then
@@ -619,6 +630,11 @@ subroutine discardbroken(ch,env,nat,nall,at,xyz,comments,newnall)
          orderref(j)=newnall    
        endif
     enddo
+    if(allocated(incl_atoms)) deallocate(incl_atoms_mask)
+    if(allocated(incl_atoms_mask)) deallocate(incl_atoms_mask)
+    if (newnall.eq.0)then
+        error stop "Error: function discardbroken eliminated all structures"
+    endif

     !--- sort the xyz array (only if structures have been discarded)
     if(newnall.lt.nall)then
pprcht commented 5 months ago

The dissociation check should be turned off entirely if --notopo is specified, as you suggested earlier. As of #250 this is done

andreacornia commented 1 month ago

I am facing a similar problem with a metal-organic system: a few good conformers are discarded as "clashes".

More. I tried to run cregen (standalone and with default parameters except adding ---notopo) on an ensemble of unique conformers produced by CREST itself and..... some conformers are discarded as "clashes"! This is meaningless. Isn't CREST using exactly the same filtering procedure?

Even more. Runs with exactly the same parameters can give different results and sometimes no conformers are discarded at all! This is puzzling. I attach the output files of 20 consecutive runs. Some explanation:

"crest_conformersall.xyz" contains the 23 conformers to be sorted; they were obtained by conformational sampling using CREST and starting from "CoLpysorted_preopt.xyz", which is also used as the reference structure in cregen (by the way, why is this needed in combination with --notopo?)

"cregen_fpnew.sh" is the shell file used to launch the 20 identical runs.

You will see that sometimes 3 conformers are discarded as clashes, sometimes they are not.... What's wrong?

cregen.zip

andreacornia commented 1 month ago

This is another example on a much bigger ensemble (1575 conformers). Again, two different outputs are obtained in consecutive runs:

cregen1.out, with 99 unique conformers and 358 removed "clashes"; cregen2.out, with 125 unique conformers and no "clashes".

This example demonstrates that the underlying problem is either "on" or "off", as it affects a very large number of conformers at the same time.

cregen_b.zip

pprcht commented 1 month ago

@andreacornia CREGEN is deterministic. This means if your input is the same, something else is going on that shouldn't. Checking your cregen_b.zip shows what this is: If you look at cregen1.out, right after repeating the command, CREST says it recognized the line: -notopo <x> : ignoring topology on 1 atoms (Co) while this is missing in cregen2.out. Obviously you didn't specify Co in the shell script, but for some (bash-related, maybe?) reason the program thinks so. Consequently, CREGEN then thinks it has topology except for Co in the first calculation, while it ignores all topology in the second. This is what you are observing. I don't understand why the argument parser is trying to interpret seemingly random trailing bits as "Co" or 1, there is obviously nothing there in the command line. An easy "fix" for now would be to move -notopo in front of the --cregen command in your command line. I will investigate if it can be caught in-code.

by the way, why is this needed in combination with --notopo?

Because CREST needs initialization of its environment (system size etc.), which happens formally before parsing the --cregen command.

andreacornia commented 1 month ago

Great! I moved -notopo as you suggested and this solved the problem. Conformers are no longer removed as clashes! Many thanks!