ASCDB - Githubissues

MolSSI / MQCAS

History information for the data in the MolSSI QCArchive Server (MQCAS)

https://qcarchive.molssi.org

BSD 3-Clause "New" or "Revised" License

5 stars 2 forks source link

ASCDB #1

Closed dgasmith closed 4 years ago

dgasmith commented 4 years ago

Describe the data you'd like Upload the ACCDB dataset found here.

Describe ways to obtain the data Download data from the ACCDB repository.

Willing to contribute Pier Morgante should be able to help with the ingestion. MolSSI will help with some compute.

Additional context

Upload Example

```python import qcportal as ptl from qcfractal import FractalSnowflake import pandas as pd SNOWFLAKE = True if SNOWFLAKE: snowflake = FractalSnowflake() client = snowflake.client() else: client = None print(client) ds = ptl.collections.ReactionDataset("ASCDB", client=client) with open("ASCDB.csv", "r") as handle: rxns = [x.split(",") for x in handle.read().splitlines()] gpath = "ASCDB_Geometries" contrib_name = [] contrib_value = [] for row in rxns[:5]: name = row[0] rxn = row[1:] half = len(rxn) // 2 molecules = rxn[:half] coefs = rxn[half:] rxn_data = [] for mol_name, coef in zip(molecules, coefs): mol = ptl.Molecule.from_file(gpath + "/" + mol_name + ".xyz") coef = float(coef) rxn_data.append((mol, coef)) rxn = {"default": rxn_data} ds.add_rxn(name, rxn) contrib_name.append(name) contrib_value.append(5) ds.save() contrib = { "name": "Benchmark", "theory_level": "CCSD(T)", "values": contrib_value, "index": contrib_name, "theory_level_details": {"driver": "energy"}, "units": "hartree", } ds.add_contributed_values(contrib) #ds.save() ```

PierMorgante commented 4 years ago

Hi Daniel, I attached the new ASCDB.csv file, and a new commented version of the script (add_database). They should work on your end too, then we can think about extending it for all the other databases in ACCDB. Let me know.

add_database.txt ASCDB.txt

dgasmith commented 4 years ago

Thanks! Do you feel like this is ready to be applied to the main QCArchive repository?

PierMorgante commented 4 years ago

I think it is ready to go, unless you want me to be more specific with the level of theory for each datapoint. I can add a keyword after the reference energies on the .csv file, re-write the script to add that modification and attach it here. What do you think?

mattwelborn commented 4 years ago

A couple questions:

Do different data points have different levels of theory?
Do you have any/all of the following info about the level of theory: method, basis set, program, program keywords?

PierMorgante commented 4 years ago

Matt, Yes, the points have different levels of theory based on the original database they come from (none of them come from my/our calculations). I have info about method and basis set for sure, but I need to check the original articles for the programs and their keywords. What kind of keywords (besides CP, or whether they have stability analysis or not) do you need?

mattwelborn commented 4 years ago

Are all of the benchmark data "gold standard"? Are the differences between the benchmark methods employed much smaller than the difference to a test method (e.g. DFT)? Basically, I'm trying to ascertain if having the specific level of theory data would be worth the effort (or even useful to the user), or if we can just label all of the values as "benchmark" and move on.

PierMorgante commented 4 years ago

The main goal of our database was to have all datapoints in a level of theory higher than DFT. To answer your question, yes, they are all gold standard. I asked if you wanted to incorporate the details because I saw you did it for the other databases in the collection. In my opinion it is not necessary, so you can incorporate the data I sent you as "benchmark" and I would be fine with it.

mattwelborn commented 4 years ago

Okay, that sounds like a plan. I see that contrib["theory_level"] is CCSD(T). Are all of the benchmark data CCSD(T)-based?

PierMorgante commented 4 years ago

I double-checked, and 160 datapoints (out of 200) are CCSD(T)-based (most of them at the Wn protocol), 20 are CAS-SCF/CASPT2 and 20 (transition metal compounds) come from corrected experimental values. If you need me to be more detailed, I'll attach a new script and a new .csv file that takes care of this issue.

mattwelborn commented 4 years ago

Do you have a paper for ASCDB?

PierMorgante commented 4 years ago

Yes, it's P. Morgante, R. Peverati, "Statistically representative databases for density functional theory via data science", Phys. Chem. Chem. Phys. 2019, 21(35), 19092–19103. DOI:10.1039/C9CP03211H.