duartegroup / mlp-train

MLP training for molecular systems
MIT License
28 stars 9 forks source link

Julia error running DA explicit water training #88

Open tanoury1 opened 5 months ago

tanoury1 commented 5 months ago

Hi, I updated mlptrain and ran through the DA explicit solvent example. Everything was going fine until I got to Julia. Which version of Julia should I be using? I'm currently using 1.6.7.

Progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| Time: 0:00:20 ┌─────────────┬───────┬───────┬───────┬───────┬───────┐ │ config_type │ #cfgs │ #envs │ #E │ #F │ #V │ │ String │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ ├─────────────┼───────┼───────┼───────┼───────┼───────┤ │ nothing │ 10 │ 1186 │ 10 │ 3558 │ 0 │ ├─────────────┼───────┼───────┼───────┼───────┼───────┤ │ total │ 10 │ 1186 │ 10 │ 3558 │ 0 │ │ missing │ 0 │ 0 │ 0 │ 0 │ 90 │ └─────────────┴───────┴───────┴───────┴───────┴───────┘ The total number of basis functions is length(B) = 6424 Assemble LSQ blocks in serial ERROR: LoadError: z = <6> not found in ZList AtomicNumber[<1>, <8>] Stacktrace: [1] error(s::String) @ Base ./error.jl:33 [2] z2i @ ~/.julia/packages/JuLIP/KNi0Z/src/potentials_base.jl:156 [inlined] [3] z2i @ ~/.julia/packages/JuLIP/KNi0Z/src/potentials_base.jl:161 [inlined] [4] _Bidx0(pB::PolyPairBasis{ACE.OrthPolys.TransformedPolys{Float64, PolyTransform{Int64, Float64}, ACE.OrthPolys.OrthPolyBasis{Float64}}, 2}, zi::AtomicNumber, zj::AtomicNumber) @ ACE.PairPotentials ~/.julia/packages/ACE/OVgdR/src/pairpots/pair_basis.jl:91 [5] energy(pB::PolyPairBasis{ACE.OrthPolys.TransformedPolys{Float64, PolyTransform{Int64, Float64}, ACE.OrthPolys.OrthPolyBasis{Float64}}, 2}, at::Atoms{Float64}) @ ACE.PairPotentials ~/.julia/packages/ACE/OVgdR/src/pairpots/pair_basis.jl:100 [6] (::JuLIP.MLIPs.var"#13#14"{Atoms{Float64}})(B::PolyPairBasis{ACE.OrthPolys.TransformedPolys{Float64, PolyTransform{Int64, Float64}, ACE.OrthPolys.OrthPolyBasis{Float64}}, 2}) @ JuLIP.MLIPs ./none:0 [7] iterate @ ./generator.jl:47 [inlined] [8] collect(itr::Base.Generator{Vector{JuLIP.MLIPs.IPBasis}, JuLIP.MLIPs.var"#13#14"{Atoms{Float64}}}) @ Base ./array.jl:681 [9] energy(superB::JuLIP.MLIPs.IPSuperBasis{JuLIP.MLIPs.IPBasis}, at::Atoms{Float64}) @ JuLIP.MLIPs ~/.julia/packages/JuLIP/KNi0Z/src/mlips.jl:141 [10] eval_obs(#unused#::Val{:E}, B::JuLIP.MLIPs.IPSuperBasis{JuLIP.MLIPs.IPBasis}, dat::Dat) @ IPFitting.DataTypes ~/.julia/packages/IPFitting/Ypo4v/src/datatypes.jl:28 [11] eval_obs(::String, ::JuLIP.MLIPs.IPSuperBasis{JuLIP.MLIPs.IPBasis}, ::Dat) @ IPFitting.DataTypes ~/.julia/packages/IPFitting/Ypo4v/src/datatypes.jl:13 [12] safe_append!(db::LsqDB, db_lock::Base.Threads.SpinLock, cfg::Dat, okey::String) @ IPFitting.DB ~/.julia/packages/IPFitting/Ypo4v/src/lsq_db.jl:270 [13] #9 @ ~/.julia/packages/IPFitting/Ypo4v/src/lsq_db.jl:182 [inlined] [14] #7 @ ~/.julia/packages/IPFitting/Ypo4v/src/obsiter.jl:98 [inlined] [15] tfor(f::IPFitting.var"#7#9"{Vector{Dat}, IPFitting.DB.var"#9#10"{LsqDB}, Base.Threads.SpinLock, Vector{String}, Vector{Int64}}, rg::UnitRange{Int64}; verbose::Bool, msg::String, costs::Vector{Int64}, maxnthreads::Int64) @ IPFitting.Tools ~/.julia/packages/IPFitting/Ypo4v/src/tools.jl:22 [16] tfor_observations(configs::Vector{Dat}, callback::IPFitting.DB.var"#9#10"{LsqDB}; verbose::Bool, msg::String, maxnthreads::Int64) @ IPFitting ~/.julia/packages/IPFitting/Ypo4v/src/obsiter.jl:98 [17] LsqDB(dbpath::String, basis::JuLIP.MLIPs.IPSuperBasis{JuLIP.MLIPs.IPBasis}, configs::Vector{Dat}; verbose::Bool, maxnthreads::Int64) @ IPFitting.DB ~/.julia/packages/IPFitting/Ypo4v/src/lsq_db.jl:181 [18] LsqDB(dbpath::String, basis::JuLIP.MLIPs.IPSuperBasis{JuLIP.MLIPs.IPBasis}, configs::Vector{Dat}) @ IPFitting.DB ~/.julia/packages/IPFitting/Ypo4v/src/lsq_db.jl:177 [19] top-level scope @ ~/Duarte-codes/mlp-train/examples/DA_paper/training/explicit/water_sys.jl:59 in expression starting at /cluster/home/tanoury/Duarte-codes/mlp-train/examples/DA_paper/training/explicit/water_sys.jl:59

All the best, Jerry

physorgchem commented 5 months ago

I think this is due to TS as both ConfigurationSet and logic variable. I had the same problem with the previous script and the current one. I fixed it by changing two variables: generate_init_configs(n, bulk_water_logic=True, TS_logic=True).

I can see the generated water_sys containing TS for the pure water system. ERROR: LoadError: z = <6> not found in ZList AtomicNumber[<1>, <8>]

Hanwen1018 commented 5 months ago

Dear all, just to clarify, during which section or machine-learning potential of the training did you experience this issue?

physorgchem commented 5 months ago

@Hanwen1018 I believe this is related to DA_paper/training/explicit/endo_ace_ex.py. My own testing/reading of the code led me think this error might be due to this part.

def generate_init_configs(n, bulk_water=True, TS=True): """generate initial configuration to train potential it can generate three sets (pure water, TS immersed in water and TS bounded two water molecules) of initial configuration by modify the boolean variables n: number of init_configs bulk_water: whether to include a solution TS: whether to include the TS of the reaction in the system""" init_configs = mlt.ConfigurationSet() TS = mlt.ConfigurationSet() TS.load_xyz(filename='cis_endo_TS_wB97M.xyz', charge=0, mult=1) TS = TS[0] TS.box = Box([11, 11, 11]) TS.charge = 0 TS.mult = 1

Hanwen1018 commented 5 months ago

Would you mind testing the following code:

ts_in_water_init = generate_init_configs(n=10, bulk_water=True, TS=True) ts_in_water_init.save_xyz(filename = 'init_config.xyz')

To check whether the configurations can be generated and saved?

physorgchem commented 5 months ago

thanks @Hanwen1018!

I tested the following codes following your suggestions:

`if name == 'main': water_mol = mlt.Molecule(name='h2o.xyz') ts_mol = mlt.Molecule(name='cis_endo_TS_wB97M.xyz')

ts_in_water_init = generate_init_configs(n=10, bulk_water=True, TS=True)
ts_in_water_init.save_xyz(filename = 'ts_in_water_init_config.xyz')

water_sys_init = generate_init_configs(n=10, bulk_water=True, TS=False)
water_sys_init.save_xyz(filename = 'water_sys_init_config.xyz')`

ts_in_water_init_config.xyz looks perfect but there are problems with water_sys_init_config.xyz - I believe that is the error @tanoury1 and I met (I had the same issue in the earlier version).

Even though it is a water sys but water_sys_init_config.xyz contains TS (see below). When running Julia to fit the potential, the input *.jl file only has two element for water sys - I think that is what the error was complaining about.

Hopefully I understood the code correctly. So I modified bulk_water and TS def generate_init_configs(n, bulk_water=True, TS=True) as within generate_init_configs TS is used as ConfigurationSet as well.

109 Lattice="100.000000 0.000000 0.000000 0.000000 100.000000 0.000000 0.000000 0.000000 100.000000" C 5.68346 7.18986 6.64891 C 4.87191 6.17740 7.41838 C 5.76285 5.17128 7.80558 H 3.99704 6.44672 7.99201 C 6.91477 5.25013 7.02144 H 5.53927 4.37712 8.50069 C 6.78320 6.30828 6.13985 H 7.70367 4.51517 7.00683 H 7.53838 6.63622 5.44209 H 6.11653 7.89680 7.36627 H 5.14232 7.75235 5.89481 C 5.10394 5.32733 4.80340 C 4.05770 5.36098 5.71525 H 3.29633 6.12216 5.61504 H 3.72848 4.41837 6.12557 C 5.82878 4.07143 4.59485 H 5.23046 6.11091 4.06977 O 5.66512 3.10913 5.32338 C 6.80456 4.02397 3.44137 H 7.47298 4.88474 3.46615 H 6.25537 4.06411 2.49931 H 7.37889 3.10320 3.48301 O 1.03627 7.68908 1.96254 H 1.43769 7.72623 2.86703 H 0.30603 8.35745 1.98754 O 2.24636 9.66622 7.89843 H 3.18021 9.97683 8.00821

physorgchem commented 5 months ago

It is identical to your example under DA_paper: endo_ace_ex.py but I changed the main part to the following for the testing.

if name == 'main': water_mol = mlt.Molecule(name='h2o.xyz') ts_mol = mlt.Molecule(name='cis_endo_TS_wB97M.xyz')

ts_in_water_init = generate_init_configs(n=10, bulk_water=True, TS=True)
ts_in_water_init.save_xyz(filename = 'ts_in_water_init_config.xyz')

water_sys_init = generate_init_configs(n=10, bulk_water=True, TS=False)
water_sys_init.save_xyz(filename = 'water_sys_init_config.xyz')
Hanwen1018 commented 5 months ago

I saw your modification. It is correct. Now, the function should be

def generate_init_configs(n, bulk_water_logic=True, TS_logic=True):
    """generate initial configuration to train potential
    it can generate three sets (pure water, TS immersed in water and TS bounded two water molecules)
    of initial configuration by modify the boolean variables
    n: number of init_configs
    bulk_water: whether to include a solution
    TS: whether to include the TS of the reaction in the system"""
    init_configs = mlt.ConfigurationSet()
    TS = mlt.ConfigurationSet()
    TS.load_xyz(filename='cis_endo_TS_wB97M.xyz', charge=0, mult=1)
    TS = TS[0]
    TS.box = Box([11, 11, 11])
    TS.charge = 0
    TS.mult = 1

    if bulk_water_logic:
        # TS immersed in a water box
        if TS_logic:
            water_mol = mlt.Molecule(name='h2o.xyz')
            water_system = mlt.System(water_mol, box=Box([11, 11, 11]))
            water_system.add_molecules(water_mol, num=43)
            for i in range(n):
                solvated = solvation(
                    solute_config=TS,
                    solvent_config=water_system.random_configuration(),
                    apm=3,
                    radius=1.7,
                )
                init_configs.append(solvated)

        # pure water box
        else:
            water_mol = mlt.Molecule(name='h2o.xyz')
            water_system = mlt.System(water_mol, box=Box([9.32, 9.32, 9.32]))
            water_system.add_molecules(water_mol, num=26)

            for i in range(n):
                pure_water = water_system.random_configuration()
                init_configs.append(pure_water)

    # TS bounded with two water molecules at carbonyl group to form hydrogen bond
    else:
        assert TS_logic is True, 'cannot generate initial configuration'
        for i in range(n):
            TS_with_water = add_water(solute=TS, n=2)
            init_configs.append(TS_with_water)

    # Change the box of system to extermely large to imitate cluster system
    # the box is needed for ACE potential
    for config in init_configs:
        config.box = Box([100, 100, 100])
    return init_configs

Would you mind testing this function for pure water configurations?

physorgchem commented 5 months ago

Yes! that is what I did - it worked. There are a few places in the script (in the main) which need to be fixed.

Thanks for your help! We are really interested in using this package.

Hanwen1018 commented 5 months ago

Thank you for testing and letting us know the bugs. We will fix them. If you have any question, pls contact us

tanoury1 commented 5 months ago

OK. That worked. I edited the file. Needed to make an additional edit (unless I missed something) updated bulk_water and TS as you noted. Traceback (most recent call last): File "/cluster/home/tanoury/Duarte-codes/mlp-train/examples/DA_paper/training/explicit/endo_ace_ex.py", line 279, in water_init = generate_init_configs(n=10, bulk_water=True, TS=False) TypeError: generate_init_configs() got an unexpected keyword argument 'bulk_water'

Don't know if you can help me with my next issues: IPFitting is not precompiling when running install_ace.py. I tried to do it manually in Julia, but still no luck. The ACE folks seem to be moving away from IPFitting and replacing it with ACEfitting (or something like that). Did you experience the same issues with IPFitting? What version of Julia are you using?

Thanks, Jerry

physorgchem commented 5 months ago

I am using julia-1.7.1 but it always complains about IPFitting... but it seems to run afterwards.

tanoury1 commented 5 months ago

OK. Then it must be something else. Below if the full, and lengthy, error output:

2024-03-15 18:20:04 hpchead mlptrain.log[313412] WARNING Save called without defining what energy and forces to print. Had true energies to using those
2024-03-15 18:20:04 hpchead mlptrain.log[313412] INFO Training an ACE potential on 10 training data 2024-03-15 18:20:07 hpchead mlptrain.log[313412] INFO ACE training ran in 0.1 m Traceback (most recent call last): File "/cluster/home/tanoury/Duarte-codes/mlp-train/examples/DA_paper/training/explicit/endo_ace_ex.py", line 280, in Water_mlp.al_train( File "/cluster/home/tanoury/Duarte-codes/mlp-train/mlptrain/potentials/_base.py", line 196, in al_train al_train(self, method_name=method_name, **kwargs) File "/cluster/home/tanoury/Duarte-codes/mlp-train/mlptrain/training/active.py", line 165, in train mlp.train() File "/cluster/home/tanoury/Duarte-codes/mlp-train/mlptrain/potentials/_base.py", line 66, in train self._train() File "/cluster/home/tanoury/Duarte-codes/mlp-train/mlptrain/potentials/ace/ace.py", line 58, in _train raise RuntimeError(f'ACE train errored with:\n{err.decode()}\n') RuntimeError: ACE train errored with: ERROR: LoadError: Creating a new global in closed module __toplevel__ (##meta#58) breaks incremental compilation because the side effects will not be permanent. Stacktrace: [1] top-level scope @ none:1 [2] eval @ ./boot.jl:385 [inlined] [3] initmeta(m::Module) @ Base.Docs ./docs/Docs.jl:85 [4] doc!(module::Module, b::Base.Docs.Binding, str::Base.Docs.DocStr, sig::Any) @ Base.Docs ./docs/Docs.jl:235 [5] top-level scope @ ~/.julia/packages/IPFitting/Ypo4v/src/IPFitting.jl:1 [6] include @ ./Base.jl:495 [inlined] [7] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::String) @ Base ./loading.jl:2222 [8] top-level scope @ stdin:3 in expression starting at /cluster/home/tanoury/.julia/packages/IPFitting/Ypo4v/src/IPFitting.jl:1 in expression starting at stdin:3 ERROR: LoadError: Failed to precompile IPFitting [3002bd4c-79e4-52ce-b924-91256dde4e52] to "/cluster/home/tanoury/.julia/compiled/v1.10/IPFitting/jl_FPpKSS". Stacktrace: [1] error(s::String) @ Base ./error.jl:35 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool) @ Base ./loading.jl:2468 [3] compilecache @ ./loading.jl:2340 [inlined] [4] (::Base.var"#968#969"{Base.PkgId})() @ Base ./loading.jl:1974 [5] mkpidlock(f::Base.var"#968#969"{Base.PkgId}, at::String, pid::Int32; kwopts::@Kwargs{stale_age::Int64, wait::Bool}) @ FileWatching.Pidfile ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:93 [6] #mkpidlock#6 @ ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:88 [inlined] [7] trymkpidlock(::Function, ::Vararg{Any}; kwargs::@Kwargs{stale_age::Int64}) @ FileWatching.Pidfile ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/FileWatching/src/pidfile.jl:111 [8] #invokelatest#2 @ ./essentials.jl:894 [inlined] [9] invokelatest @ ./essentials.jl:889 [inlined] [10] maybe_cachefile_lock(f::Base.var"#968#969"{Base.PkgId}, pkg::Base.PkgId, srcpath::String; stale_age::Int64) @ Base ./loading.jl:2983 [11] maybe_cachefile_lock @ ./loading.jl:2980 [inlined] [12] _require(pkg::Base.PkgId, env::String) @ Base ./loading.jl:1970 [13] __require_prelocked(uuidkey::Base.PkgId, env::String) @ Base ./loading.jl:1812 [14] #invoke_in_world#3 @ ./essentials.jl:926 [inlined] [15] invoke_in_world @ ./essentials.jl:923 [inlined] [16] _require_prelocked(uuidkey::Base.PkgId, env::String) @ Base ./loading.jl:1803 [17] macro expansion @ ./loading.jl:1790 [inlined] [18] macro expansion @ ./lock.jl:267 [inlined] [19] __require(into::Module, mod::Symbol) @ Base ./loading.jl:1753 [20] #invoke_in_world#3 @ ./essentials.jl:926 [inlined] [21] invoke_in_world @ ./essentials.jl:923 [inlined] [22] require(into::Module, mod::Symbol) @ Base ./loading.jl:1746 in expression starting at /cluster/home/tanoury/Duarte-codes/mlp-train/examples/DA_paper/training/explicit/water_sys.jl:1

Jerry

Hanwen1018 commented 5 months ago

Hi, please try to install "JuLIP", version="0.10.1"; "ACE", version="0.8.4"; and also "IPFitting", version="0.5.0". Also would you mind sharing the _watersys.jl with us?

tanoury1 commented 5 months ago

Yep. I've got those exact versions installed. I'm running Julia 1.10.

water_sys.jl attached.

water_sys.jl.txt

Also, perhaps it may be an ACE error. Near the top of the full error output, there is:

RuntimeError: ACE train errored with: ERROR: LoadError: Creating a new global in closed module __toplevel__ (##meta#58) breaks incremental compilation because the side effects will not be permanent.

tanoury1 commented 5 months ago

This might have something to do with my Julia version. I installed v. 1.7.1 using 'juliaup add 1.7.1', then adding the packages again for this version of Julia.

I got further along running water_sys.jl directly from command line, but got a 'ERROR: LoadError: MethodError: no method matching pretty_table'. I'm going do a fresh conda install of mlptrain-ace with julia 1.7.1 being my default version and see how things go.

Hanwen1018 commented 5 months ago

Hi, how is your reinstallation going? If it still doesn't work, you can try to train with MACE potential first, and we can prepare an installation script for you at the meantime.

tanoury1 commented 5 months ago

Hi.
Endo_ace_ex.py is working. Horray!! The training has been going for about 18 hr. After making the edits to the script (bulk_water_logic and TS_logic), I had to rebuild the ace conda environment with Julia 1.7.1. That was the secret. It seems this only works with v1.7.1.

It did not work for me with v1.10.0.

Jerry

tanoury1 commented 5 months ago

I am getting an error, but I don't know if it means anything. The job continues to run:

2024-03-20 13:37:23 hpchead autode.log.log[297226] INFO Getting gradients from tmp_orca.out 2024-03-20 13:37:23 hpchead autode.log.log[297226] INFO Getting energy from tmp_orca.out 2024-03-20 13:37:23 hpchead autode.log.log[297226] INFO Checking for tmp_orca.out normal termination 2024-03-20 13:37:23 hpchead autode.log.log[297226] INFO orca terminated normally 2024-03-20 13:37:23 hpchead autode.log.log[297226] INFO Getting atomic charges from tmp_orca.out 2024-03-20 13:37:24 hpchead mlptrain.log[104027] INFO PLUMED coordinates not defined - returning None 2024-03-20 13:37:24 hpchead mlptrain.log[104027] ERROR predicted forces not defined - returning None 2024-03-20 13:37:24 hpchead mlptrain.log[104027] WARNING Not adding basis functions on H 2024-03-20 13:37:24 hpchead mlptrain.log[104027] WARNING Save called without defining what energy and forces to print. Had true energies to using those

physorgchem commented 5 months ago

If this is related with TS_in_water system, I noticed that there is another potential problem in the script: the generated (10) configs don't have exactly the same atom as it is packing water within a box. I hardcoded to include those with the same number of waters into the valid configs otherwise discarded it. (see https://github.com/duartegroup/mlp-train/issues/57#issuecomment-1718806122)

Hanwen1018 commented 5 months ago

Hi, the potential can try system with different atomic size. I want to double check, during which section or machine-learning potential of the training did you experience this issue? Did you obtain trained potentials?

tanoury1 commented 5 months ago

I don't have a trained potential yet. The training is still running. I have 49 dataset files is the datasets directory. Not sure how to determine how close I am to convergence.

physorgchem commented 5 months ago

Hi, the potential can try system with different atomic size. I want to double check, during which section or machine-learning potential of the training did you experience this issue? Did you obtain trained potentials?

That is correct - I had to separate training (pure water, TS in water etc) and combine later due the the HPC queue time. This works, which means they can be trained on the datasets with different number of atoms.

But weirdly for the training on TS_in_water only, I had this issue and I harded code to select only those with the same number of waters. Sorry to injecting my query into the discussions.

Hanwen1018 commented 5 months ago

I don't have a trained potential yet. The training is still running. I have 49 dataset files is the datasets directory. Not sure how to determine how close I am to convergence.

Do you have anything _al.xyz or _al.npz file? I just want to clarify whether the issues you have come from starting training of other potential

Hanwen1018 commented 5 months ago

But weirdly for the training on TS_in_water only, I had this issue and I harded code to select only those with the same number of waters. Sorry to injecting my query into the discussions.

Oh, this is because of the recently implementation of Metadynamics in AL. Even in this case, the MTD doesn't apply; the configurations need to pass the MTD bias check first, which requires the same size of configurations. I will update the example code.

tanoury1 commented 5 months ago

I have water_sys_al.xyz and water_sys_al.npz files. So, things seem to be going as expected.

tanoury1 commented 5 months ago

Hi, looks like I may have run into a numpy error. I am using numpy version 1.26.4.

I've attached the entire output from the training. endo.output.txt

To make sure my endo_ace_ex.py is correct (after making the edits we discussed above), here is the portion where the code errored out (I think....):

generate sub training set of pure water system by AL training

water_system = mlt.System(water_mol, box=Box([100, 100, 100]))
water_system.add_molecules(water_mol, num=26)
Water_mlp = mlt.potentials.ACE('water_sys', water_system)
water_init = generate_init_configs(n=10, bulk_water_logic=True, TS_logic=False)
Water_mlp.al_train(
    method_name='orca',
    selection_method=AtomicEnvSimilarity(),
    fix_init_config=True,
    init_configs=water_init,
    max_active_time=5000,
)

# generate sub training set of TS in water system by AL training
ts_in_water = mlt.System(ts_mol, box=Box([100, 100, 100]))
ts_in_water.add_molecules(water_mol, num=40)
ts_in_water_mlp = mlt.potentials.ACE('TS_in_water', ts_in_water)
ts_in_water_init = generate_init_configs(n=10, bulk_water_logic=True, TS_logic=True)
ts_in_water_mlp.al_train(
    method_name='orca',
    selection_method=AtomicEnvSimilarity(),
    fix_init_config=True,
    init_configs=ts_in_water_init,
    max_active_time=5000,
)

# generate sub training set of TS with two water system by AL training
ts_2water = mlt.System(ts_mol, box=Box([100, 100, 100]))
ts_2water.add_molecules(water_mol, num=2)
ts_2water_mlp = mlt.potentials.ACE('TS_2water', ts_2water)
ts_2water_init = generate_init_configs(n=10, bulk_water_logic=False, TS_logic=True)
ts_2water_mlp.al_train(
    method_name='orca',
    selection_method=AtomicEnvSimilarity(),
    fix_init_config=True,
    init_configs=ts_2water_init,
    max_active_time=5000,
)
physorgchem commented 5 months ago

But weirdly for the training on TS_in_water only, I had this issue and I harded code to select only those with the same number of waters. Sorry to injecting my query into the discussions.

Oh, this is because of the recently implementation of Metadynamics in AL. Even in this case, the MTD doesn't apply; the configurations need to pass the MTD bias check first, which requires the same size of configurations. I will update the example code.

My error is identical to @tanoury1 's. You can see from their output that pure water worked and it happened after training TS_in_water MLP based on 10 snapshots. This problem will go away if I hardcoded in generate_init_configs to include only those with a fixed number of waters (the most probable one by sampling a large number of configs).

https://github.com/duartegroup/mlp-train/issues/88#issuecomment-2016536682

physorgchem commented 5 months ago

Hi, looks like I may have run into a numpy error. I am using numpy version 1.26.4.

I've attached the entire output from the training. endo.output.txt

To make sure my endo_ace_ex.py is correct (after making the edits we discussed above), here is the portion where the code errored out (I think....):

generate sub training set of pure water system by AL training

water_system = mlt.System(water_mol, box=Box([100, 100, 100]))
water_system.add_molecules(water_mol, num=26)
Water_mlp = mlt.potentials.ACE('water_sys', water_system)
water_init = generate_init_configs(n=10, bulk_water_logic=True, TS_logic=False)
Water_mlp.al_train(
    method_name='orca',
    selection_method=AtomicEnvSimilarity(),
    fix_init_config=True,
    init_configs=water_init,
    max_active_time=5000,
)

# generate sub training set of TS in water system by AL training
ts_in_water = mlt.System(ts_mol, box=Box([100, 100, 100]))
ts_in_water.add_molecules(water_mol, num=40)
ts_in_water_mlp = mlt.potentials.ACE('TS_in_water', ts_in_water)
ts_in_water_init = generate_init_configs(n=10, bulk_water_logic=True, TS_logic=True)
ts_in_water_mlp.al_train(
    method_name='orca',
    selection_method=AtomicEnvSimilarity(),
    fix_init_config=True,
    init_configs=ts_in_water_init,
    max_active_time=5000,
)

# generate sub training set of TS with two water system by AL training
ts_2water = mlt.System(ts_mol, box=Box([100, 100, 100]))
ts_2water.add_molecules(water_mol, num=2)
ts_2water_mlp = mlt.potentials.ACE('TS_2water', ts_2water)
ts_2water_init = generate_init_configs(n=10, bulk_water_logic=False, TS_logic=True)
ts_2water_mlp.al_train(
    method_name='orca',
    selection_method=AtomicEnvSimilarity(),
    fix_init_config=True,
    init_configs=ts_2water_init,
    max_active_time=5000,
)

Your MLP for pure water worked though! So I doubt that is because of numpy.

juraskov commented 5 months ago

Hi, I think the issue

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (20,) + inhomogeneous part.

Is indeed coming from the numpy, but it is triggered only when the training set contains structures with different numbers of molecules. I have corrected it in PR #89, hopefully it will be enough. Please test the update. Eventually, try to decrease the version of numpy. As my PR currently fails the tests, I will need to check a bit more in detail what is happening in the arrays.

tanoury1 commented 5 months ago

I was able to complete the training with the updated code, using numpy 1.26.4. To confirm, here is the list of the files in my explicit directory: cis_endo_TS_wB97M.xyz datasets endo_ace_ex.py endo_in_water_ace_wB97M.json h2o.xyz TS_2water_al.npz TS_2water_al.xyz TS_2water.json TS_gasphase_al.npz TS_gasphase_al.xyz TS_gasphase.json TS_in_water_al.npz TS_in_water_al.xyz TS_in_water.json water_sys_al.npz water_sys_al.xyz water_sys.json

It took 5.5 days to get all the training done. Does that timing sound correct? I ran on 60 cores.

Jerry