bjodah / chempy

⚗ A package useful for chemistry written in Python
BSD 2-Clause "Simplified" License
544 stars 78 forks source link

Non-integer subscripts in formulas #176

Closed spizwhiz closed 2 years ago

spizwhiz commented 4 years ago

Hi,

Is it possible in ChemPy to parse formulas for substances that have non-integer subscripts?

e.g.

'Ca2.832Fe0.6285Mg2.5395(CO3)6'

If so, how would you format the formula above to achieve the desired result?

Right now, I get the following:

cp.Substance.from_formula('Ca2.832Fe0.6285Mg2.5395(CO3)6')

Ca2⋅832Fe0⋅6285Mg2⋅5395(CO3)6

Using formulas like this is a common need when working with natural minerals that behave as a solid solution. I thought about just increasing the subscripts until they are all integers, but would then have a molar mass that is incorrect.

image

jeremyagray commented 4 years ago

Not right now. The relevant bits of code are in chempy/util/parser.py in _get_formula_parser():

Suppress, Word, nums = _p.Suppress, _p.Word, _p.nums

LPAR, RPAR = map(Suppress, "()")
integer = Word(nums)

# add parse action to convert integers to ints, to support doing

addition

and multiplication at parse time

integer.setParseAction(lambda t: int(t[0]))

nums is a regular expression looking for strings containing 0-9 if I remember correctly about pyparsing. I don't know your usage details, but you could always scale your subscripts to avoid decimals and then use the scale factor to readjust calculated results. This bit of the code could be changed to look for integer or decimal subscripts but you would have to be careful about not interfering with the parsing of hydrated formulas that are sometimes written like BaCl2.2H2O (which doesn't appear to be currently implemented) and I'm not sure how this might interact with the rest of the package.

I have been working on the parser and have some limited support for complexes like K4[Fe(CN)6] (with complex anions, cations, and both) and I believe I have implemented hydrates as well. I may be able to add in decimal subscripts as well but it will be harder to do as you would need to look ahead and determine if you're parsing a subscript or a hydrate. However none of this work has been tested outside the parsing module and I have no idea how well it will interact with the rest of the package.

On Thu, Jul 16, 2020 at 4:19 PM spizwhiz notifications@github.com wrote:

Hi,

Is it possible in ChemPy to parse formulas for substances that have non-integer subscripts?

e.g.

'Ca2.832Fe0.6285Mg5.395(CO3)6'

If so, how would you format the formula above to achieve the desired result?

Right now, I get the following:

cp.Substance.from_formula('Ca2.832Fe0.6285Mg5.395(CO3)6')

Ca2⋅832Fe0⋅6285Mg5⋅395(CO3)6

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bjodah/chempy/issues/176, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOQCHSYWF77JRBA22O3GJULR35VGLANCNFSM4O46UOJQ .

-- Jeremy A. Gray

spizwhiz commented 4 years ago

Thanks, I think I understand the issues.

Would requiring the user to use some kind of special notation for decimal subscripts potentially help with the hydrates issue?

E.g. putting the subscript in square brackets or curly braces?

'Ca{2.832}Fe{0.6285}Mg{2.5395}(CO3)6'

jeremyagray commented 4 years ago

I think something like that could work for ASCII. In unicode though, there are separate subscript characters, periods, and raised dots that would disambiguate subscripts and hydrates, so this could be a choice. My preference would be ASCII and parse it correctly, if possible, with minimal differences from how we would write a formula normally.

I will tinker with it some soon and see if I can’t at least prototype parsing decimal subscripts.

On Thu, Jul 16, 2020 at 17:01 spizwhiz notifications@github.com wrote:

Thanks, I think I understand the issues.

Would requiring the user to use some kind of special notation for decimal subscripts potentially help with the hydrates issue?

E.g. putting the subscript in square brackets or curly braces?

'Ca{2.832}Fe{0.6285}Mg{2.5395}(CO3)6'

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/bjodah/chempy/issues/176#issuecomment-659698459, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOQCHS5WCA67S4HXBWBYO7LR352CZANCNFSM4O46UOJQ .

-- Jeremy A Gray Gray Farms www.grayfarms.org 205.544.4573

spizwhiz commented 4 years ago

Awesome, thanks!

I wish I could help more, but I don't think I would be that much help to you at this point.

spizwhiz commented 4 years ago

For anyone else needing to work with decimal subscripts, here is a function to quickly calculate the smallest integer subscripts, and the scaling factor.

import decimal

def integer_subs(frac_subs):
    dec = [decimal.Decimal(str(sub)).as_tuple().exponent for sub in frac_subs]
    decmin = min(dec)
    mf = 10**-decmin
    subs = frac_subs*mf
    subs = subs.round()
    subs = subs.astype(int)

    cd = gcd.reduce(subs)

    subsf = (subs/cd).astype(int)
    print('Integer Subscripts: ', subsf)

    sf = mf/cd
    print('Scaling Factor:', sf)
subs1 = np.array([2.832,0.6285,2.5395,6])

integer_subs(subs1)

Integer Subscripts:  [1888  419 1693 4000]
Scaling Factor: 666.6666666666666
bjodah commented 4 years ago

Hi, and thank you both for looking into this.

I'm open to changing the syntax for crystal water to allow parsing non-integer stoichiometric coefficients.

What about "Na2SO4:10H2O", does that look "natural" enough in your eyes? (just a spontaneous suggestion of mine). I personally won't be able to code up a prototype anytime soon I'm afraid, but if you want to go ahead I'll definitely make time for code review and publishing an updated release etc.

Changing the syntax will be a breaking change, so we'll need to bump the version number, ideally one could have the parser accept both old and new syntax for one intermediate release, issuing a warning instructing the user to migrate to the new syntax. But if that is not possible I'm open to skipping the deprecation cycle.