danieleongari / CURATED-COFs

Clean, Uniform and Refined with Automatic Tracking from Experimental Database (CURATED) COFs
MIT License
34 stars 11 forks source link

Fix duplicate structures #40

Closed danieleongari closed 1 year ago

danieleongari commented 1 year ago

In the latest PR some structures were identified as duplicates:

for couple in couples:
    atoms1=read(f"cifs/{couple[0]}.cif")
    atoms2=read(f"cifs/{couple[1]}.cif")
    cell1 = [ round(x,2) for x in atoms1.cell.cellpar()]
    cell2 = [ round(x,2) for x in atoms1.cell.cellpar()]
    print(couple, str(atoms1.get_chemical_formula())==str(atoms2.get_chemical_formula()), cell1==cell2, cell1, cell2)

gives:

('20690N2', '12000N2') True True [22.56, 22.56, 7.0, 90.0, 90.0, 120.0] [22.56, 22.56, 7.0, 90.0, 90.0, 120.0]
('21810N2', '15040N2') True True [18.55, 18.55, 7.2, 90.0, 90.0, 120.0] [18.55, 18.55, 7.2, 90.0, 90.0, 120.0]
('22102N2', '16140N2') True True [38.99, 38.99, 10.12, 90.0, 90.0, 120.0] [38.99, 38.99, 10.12, 90.0, 90.0, 120.0]
('22107N2', '14020N2') True True [38.64, 38.64, 9.96, 90.0, 90.0, 120.0] [38.64, 38.64, 9.96, 90.0, 90.0, 120.0]
('22109N2', '13100N2') True True [33.92, 32.78, 13.35, 90.0, 90.0, 90.0] [33.92, 32.78, 13.35, 90.0, 90.0, 90.0]
('22260N2', '19413N2') True True [38.97, 38.97, 6.99, 90.0, 90.0, 120.0] [38.97, 38.97, 6.99, 90.0, 90.0, 120.0]
('22261N2', '21600N2') True True [38.97, 38.97, 6.99, 90.0, 90.0, 120.0] [38.97, 38.97, 6.99, 90.0, 90.0, 120.0]
('22301N2', '16271N2') True True [45.82, 45.82, 9.26, 90.0, 90.0, 120.0] [45.82, 45.82, 9.26, 90.0, 90.0, 120.0]
('22310N2', '15040N2') True True [18.83, 18.72, 7.22, 90.0, 90.0, 120.0] [18.83, 18.72, 7.22, 90.0, 90.0, 120.0]
('22330N2', '20320N2') True True [42.33, 42.33, 7.77, 90.0, 90.0, 120.0] [42.33, 42.33, 7.77, 90.0, 90.0, 120.0]
('22351N2', '22300N2') True True [58.45, 59.02, 9.49, 90.0, 90.0, 120.0] [58.45, 59.02, 9.49, 90.0, 90.0, 120.0]
('22410N2', '22181N2') True True [23.11, 23.11, 7.07, 90.0, 90.0, 120.0] [23.11, 23.11, 7.07, 90.0, 90.0, 120.0]
('224510N2', '17070N2') True True [38.9, 38.9, 7.58, 90.0, 90.0, 120.0] [38.9, 38.9, 7.58, 90.0, 90.0, 120.0]
('224511N2', '14073N2') False True [24.56, 24.31, 7.58, 90.0, 90.0, 93.14] [24.56, 24.31, 7.58, 90.0, 90.0, 93.14]
('224515N2', '21340N2') True True [45.14, 44.61, 7.24, 89.97, 89.99, 120.01] [45.14, 44.61, 7.24, 89.97, 89.99, 120.01]
('224516N2', '20320N2') True True [44.91, 44.14, 7.96, 83.4, 96.99, 120.21] [44.91, 44.14, 7.96, 83.4, 96.99, 120.21]
('224520N2', '11020N2') True True [21.87, 21.87, 7.46, 90.0, 90.0, 120.0] [21.87, 21.87, 7.46, 90.0, 90.0, 120.0]
('224521N2', '16330N2') True True [32.0, 32.0, 7.39, 90.0, 90.0, 120.0] [32.0, 32.0, 7.39, 90.0, 90.0, 120.0]
('224523N2', '14000N2') True True [15.28, 15.28, 7.04, 90.0, 90.0, 120.0] [15.28, 15.28, 7.04, 90.0, 90.0, 120.0]
('224525N2', '16480N2') True True [25.62, 25.62, 7.06, 90.0, 90.0, 120.0] [25.62, 25.62, 7.06, 90.0, 90.0, 120.0]
('224529N2', '15253N2') True True [29.64, 29.64, 7.06, 90.0, 90.0, 120.0] [29.64, 29.64, 7.06, 90.0, 90.0, 120.0]
('224533N2', '17191N2') True True [21.13, 21.13, 9.08, 90.0, 90.0, 120.0] [21.13, 21.13, 9.08, 90.0, 90.0, 120.0]
('224534N2', '15162N2') True True [19.67, 19.67, 7.17, 90.0, 90.0, 120.0] [19.67, 19.67, 7.17, 90.0, 90.0, 120.0]
('224535N2', '16230N2') True True [18.5, 18.5, 7.09, 90.0, 90.0, 120.0] [18.5, 18.5, 7.09, 90.0, 90.0, 120.0]
('224536N2', '12000N2') True True [23.11, 23.11, 6.92, 90.0, 90.0, 120.0] [23.11, 23.11, 6.92, 90.0, 90.0, 120.0]
('224537N2', '13150N2') True True [30.47, 30.47, 7.01, 90.0, 90.0, 120.0] [30.47, 30.47, 7.01, 90.0, 90.0, 120.0]
('224538N2', '15050N2') True True [14.97, 14.97, 6.48, 90.0, 90.0, 120.0] [14.97, 14.97, 6.48, 90.0, 90.0, 120.0]
('224539N2', '12001N2') True True [23.09, 23.09, 7.65, 90.0, 90.0, 120.0] [23.09, 23.09, 7.65, 90.0, 90.0, 120.0]
('22454N2', '19360N2') True True [36.11, 36.11, 6.94, 90.0, 90.0, 120.0] [36.11, 36.11, 6.94, 90.0, 90.0, 120.0]
('22455N2', '20610N2') True True [35.75, 35.75, 7.34, 90.0, 90.0, 120.0] [35.75, 35.75, 7.34, 90.0, 90.0, 120.0]
('22456N2', '22451N2') True True [19.65, 19.89, 12.51, 90.0, 90.0, 89.37] [19.65, 19.89, 12.51, 90.0, 90.0, 89.37]
('22459N2', '15071N2') True True [37.62, 37.62, 7.3, 90.0, 90.0, 120.0] [37.62, 37.62, 7.3, 90.0, 90.0, 120.0]
danieleongari commented 1 year ago

...continuing with the non-p2245 COFs.

Inspecting visually the others with

for i in range(0,4):
    print(i)
    couple = couples[i]

    display(view(read(f"cifs/{couple[0]}.cif"), viewer='ngl'))
    display(view(read(f"cifs/{couple[1]}.cif"), viewer='ngl'))

Results:

 ['20690N2', '12000N2'], identical
 ['21810N2', '15040N2'], identical
 ['22102N2', '16140N2'], identical just a bit rotated linkers
 ['22107N2', '14020N2'], id
 ['22109N2', '13100N2'], id shifted
 ['22260N2', '19413N2'], id
 ['22261N2', '21600N2'], id
 ['22301N2', '16271N2'], id
 ['22310N2', '15040N2'], id
 ['22330N2', '20320N2'], id
 ['22351N2', '22300N2'], id but twisted
 ['22410N2', '22181N2'], identical but different cell arrangement

therefore all are flagged as duplicates and will be moved from cof-frameworks.csv to cof-discarded.csv

mpougin commented 1 year ago

thanks for double-checking @danieleongari . I don't know why the duplicates weren't flagged when I ran the tests locally.

danieleongari commented 1 year ago

@mpougin everything should be ok now, waiting for the uniqueness CI to confirm all is ok, and then merging (hopefully) this evening.

Before there were 648 cofs (+ 80 discarded), now there are 874 cofs (+ 104 discarded), i.e., 226 (+ 24) new