echemdb / metadata-schema

Metadata schema describing electrochemical data
GNU General Public License v3.0
2 stars 2 forks source link

Open discussion on metadata standard #2

Open DunklesArchipel opened 3 years ago

DunklesArchipel commented 3 years ago

Points to discuss:

Original top level message: The yaml files for the test cases, e.g., xy.yaml, are still in the old format. These should be adapted to work properly with the future CV module.

DunklesArchipel commented 3 years ago

Based on the changes in echemdb/svgdigitizer#51, the content in the yaml file should be reconsidered.

The part on figure description now requires fewer keys since most values are directly extracted from the svg file.

nicohoermann commented 3 years ago

We should discuss once more yaml. structure: We have now for electrolyte: {name, sum formula, ...}: Question why don't we use the same name doubling in electrode materials? Name doublings can be problematic?! E.g. EtOH as in yaml is not a standardized sum formula: I think we should have only one classifyer, e.g. name and then sumformula is created from lookup table or the other way round. Also "sodium hydroxide" instead of "NaCl" e.g. on the website is not so cool. In short, I think we should homogenize names to anything, then have a function that creates the "common name": This "common name" should be what is displayed on the website. This will be "NaCl" and "Ethanole" in the discussed case, and not either name or sum formula

DunklesArchipel commented 3 years ago

EtOH is indeed some sort of trivia name. Besides, I agree that a single identifier is sufficient.

In principle, the name can be either a full name, a sum formula, or a trivia name. This could be checked against a list with dicts:

chemicals = [{'name' : 'ethanol', 
              'sum formula' : ' C2H6O', 
              'display name': 'Ethanol', 
              'alternative names': ['ethanol', 'EtOH', 'C2H6O', 'CH3CH2OH', 'C2H5OH']},
             {'name' : 'sulfuric acid', 
              'sum formula' : 'H2SO4', 
              'display name': 'H$_2$SO$_4$', 
              'alternative names': ['sulfuric acid', 'H2SO4']}]

We could also try to implement existing packages for chemical formulas:

nicohoermann commented 3 years ago

Yes, I think we should do this to go the safe way: After some looking around I found this solution, which queries synonym names of compounds, and then assigns a unique cid (chemical compound id), see code below!

We could also query then Inchi https://iupac.org/who-we-are/divisions/division-details/inchi/

What is cool you can write EtOH or ethanol, and it removes the necessity to build the above dict. Or we build the above dict from the code pasted below: Then we can translate easily into chemical formulas names/ display names etc, and it will be an automatic test that people have written actual chemical compounds into the yaml:

## Sanitize names
import pubchempy as pcp
from pubchempy import Compound
from pymatgen.core.composition import Composition

cids = pcp.get_cids('NaCl', 'name')
cid  = cids[0]
s0 = Compound.from_cid(cid)
print(s0.synonyms[0], s0.molecular_formula)
cids = pcp.get_cids('sodium chloride', 'name')
cid  = cids[0]
s1 = Compound.from_cid(cid)
print(s1.synonyms[0], s1.molecular_formula)
cids = pcp.get_cids('Sodium Chloride', 'name')
cid  = cids[0]
s2 = Compound.from_cid(cid)
print(s2.synonyms[0], s2.molecular_formula)

cc = Composition(s2.molecular_formula)
print(s2.synonyms[0], cc.reduced_formula)

cids = pcp.get_cids('ethanol', 'name')
cid  = cids[0]
s = Compound.from_cid(cid)
print(s.synonyms[0], s.molecular_formula, Composition(s.molecular_formula).reduced_formula)

cids2 = pcp.get_cids('EtOH', 'name')
cid2  = cids2[0]
s2 = Compound.from_cid(cid2)
print(s2.synonyms[0],s2.molecular_formula, Composition(s2.molecular_formula).reduced_formula)
DunklesArchipel commented 2 years ago

Great stuff. It finds random typos.

image

DunklesArchipel commented 2 years ago

Add a section for used gases in the "electrochemical system" section