cgohlke / molmass

Molecular mass calculations
https://pypi.org/project/molmass
BSD 3-Clause "New" or "Revised" License
56 stars 13 forks source link

molmass.split_charge (and thus all downstream formula-parsing) fails with trailing `]]` in charged formulae #11

Closed zpincus closed 1 year ago

zpincus commented 1 year ago

Actual behavior:

>>> molmass.split_charge('[CHNOP[13C]]2-')
('[CHNOP[13C]]2-', -2)

Expected behavior:

>>> molmass.split_charge('[CHNOP[13C]]2-')
('CHNOP[13C]', -2)

Problem: molmass.py:1813:

    m = re.search(r'([\]_]{1,})([0-9]{1,})([+-]{1,})$', formula)

where the first {1,} allows one or more of ] or _ to act as the delimiter. Changing this line to:

    m = re.search(r'([\]_])([0-9]{1,})([+-]{1,})$', formula)

resolves this issue (at the cost of disallowing multiple underscores to act as a delimiter; if this is desired, the regexp could of course be expanded). molmass --test and molmass --doctest both pass with this change.

Another related issue is the inability to parse charges of the form [CHNOP]+2 instead of [CHNOP]2+, which could also be remedied (if desired) by adding a second regexp similar to above, a la:

    m = re.search(r'([\]_])([0-9]{1,})([+-]{1,})$', formula)
    if m:
        m_delim, m_count, m_sign = m.groups()
    else:
        m = re.search(r'([\]_])([+-]{1,})([0-9]{1,})$', formula)
        if m:
            m_delim, m_sign, m_count = m.groups()
    if m:
        if m_count == '':
            charge = int(f'{m_sign}1')
        else:
            charge = int(f'{m_sign}{m_count}')
    else:
    [...]

This latter change also does not cause any current tests to fail, but perhaps there are other dark corner-cases...

Version is the current release, btw:

> molmass --version
molmass 2022.10.18
cgohlke commented 1 year ago

Thank you. I applied your fix in v2022.12.9.

cgohlke commented 1 year ago

parse charges of the form [CHNOP]+2

Is that format widely used? I think for now the IUPAC order is good enough.

zpincus commented 1 year ago

Is that format widely used? I think for now the IUPAC order is good enough.

Hmm, rdkit.Chem.rdMolDescriptors.CalcMolFormula produces CHNOP+2 which is a horror but there you go. Not sure that I've actually seen the bracketed form in the wild now that you mention it. (I forgot that rdkit made the un-bracketed version.)

cgohlke commented 1 year ago

I found this.

cgohlke commented 1 year ago

rdkit.Chem.rdMolDescriptors.CalcMolFormula produces CHNOP+2

Molmass v2023.4.10 supports rdkit-style ionic charges.

>>> Formula('CHNOP+2').charge
2