Dataset request: AIMD-Chig

jvita commented 11 months ago

Name

Josh Vita

Email

vita1@llnl.gov

Dataset name

AIMD-Chig

Authors

Tong Wang, Xinheng He, Mingyu Li, Bin Shao, Tie-Yan Liu

Links

Dataset description

This dataset covers the conformational space of Chignolin with DFT-level precision. We sequentially applied replica exchange molecular dynamics (REMD), conventional MD, and ab initio MD (AIMD) simulations on a 10 amino acid protein, Chignolin, and finally collected unparalleled 2 million biomolecule structures with quantum level energy and force records.

File details

Data repo includes README specifying folder contents/structure, which reports that the data is stored in XYZ format and is grouped by "anchor".

In total, looks to be ~15GB (zipped).

Method

DFT

Method (other)

No response

Software

ORCA

Software (other)

No response

Software version(s)

4.2.1

Additional details

M06-2X functional in conjunction with 6–31 G* basis set was employed for the calculation

Property types

Atomic forces, Potential energy

Other/additional property

No response

Property details

potential-energy: units=Hartree, per-atom=False
atomic-forces: units=Hartree/Ang

Elements

Chignolin

Number of Configurations

2,000,000

Naming convention

Names can likely be generated by //snapshot.xyz.

Configuration sets

No response

Configuration labels

No response

Distribution license

CC BY 4.0

Permissions

[ ] I confirm that I have the necessary permissions to submit this dataset

jvita commented 11 months ago

Some notes for improving the upload template:

The template should enable selecting molecule type, instead of only elements. I'm sure there's a tool out there somewhere from generating element lists from molecule names.
CS construction should support pre-built groups better. For example, the data here is grouped as /; these should be automatically converted to CSs somehow. I didn't add CS definitions here because I didn't want to have to add 100 lines of regexes, one for each group.

gpwolfe commented 10 months ago

It looks like the subdirectories [0 - 99] in this case are just a way of dividing the 10K initial structures into manageable, 100-structure chunks. The actual divisions (by initial structure, or 'anchors', as defined in the text) are separated into the 10K individual files. Would that be an unreasonable number of configuration sets?

jvita commented 10 months ago

In that case, I'd say that it should be left as a single CS. Perhaps the anchors could be given as labels.

gpwolfe commented 10 months ago

Staged for ingest after next database update

gpwolfe commented 8 months ago

new database now live

colabfit / data-lake