bjodah / chempy

⚗ A package useful for chemistry written in Python
BSD 2-Clause "Simplified" License
548 stars 79 forks source link

Zn(NO3)2.6H2O fails to parse in 0.8.3, regression? #223

Closed ibressler closed 5 months ago

ibressler commented 1 year ago

With version 0.8.2, the formula Zn(NO3)2.6H2O was parsed just fine:

>>> chempy.Substance.from_formula('Zn(NO3)2.6H2O')
<Substance(name=Zn(NO3)2.6H2O, ...)>

With version 0.8.3, I get an error:

>>> chempy.Substance.from_formula('Zn(NO3)2.6H2O')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/.pyvenv/lib/python3.10/site-packages/chempy/chemistry.py", line 190, in from_formula
    unicode_name=formula_to_unicode(formula),
  File "~/.pyvenv/lib/python3.10/site-packages/chempy/util/parsing.py", line 649, in formula_to_unicode
    return _formula_to_format(
  File "~/.pyvenv/lib/python3.10/site-packages/chempy/util/parsing.py", line 545, in _formula_to_format
    string += re.sub(r"([0-9]+\.[0-9]+|[0-9]+)", lambda m: sub(m.group(1)), stoich)
  File "/usr/local/Cellar/python@3.10/3.10.12_1/Frameworks/Python.framework/Versions/3.10/lib/python3.10/re.py", line 209, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "~/.pyvenv/lib/python3.10/site-packages/chempy/util/parsing.py", line 545, in <lambda>
    string += re.sub(r"([0-9]+\.[0-9]+|[0-9]+)", lambda m: sub(m.group(1)), stoich)
  File "~/.pyvenv/lib/python3.10/site-packages/chempy/util/parsing.py", line 650, in <lambda>
    lambda x: "".join(_unicode_sub[str(_)] for _ in x),
  File "~/.pyvenv/lib/python3.10/site-packages/chempy/util/parsing.py", line 650, in <genexpr>
    lambda x: "".join(_unicode_sub[str(_)] for _ in x),
KeyError: '.'

Is this intended? Thanks for more info on that!

Best wishes Ingo

jeremyagray commented 1 year ago

I think this is the known issue noted in #207, #211, and here. I assume just the to unicode parsing is broken but the rest are working?

I haven't done further research on the unicode alternatives, so I don't know if they have added subscript/superscript decimal separators or if a better option has arisen. I know I have a bad solution ready in a branch somewhere, but I would really like unicode to add the decimal separators.

bjodah commented 1 year ago

@ibressler , thank you for reporting this. To me this looks like a regression. And I think we need to take regressions seriously (i.e. avoid breaking existing code whenever feasible).

@jeremyagray For a subscript dot in unicode this might work (it looks alright in my terminal but I know unicode support is iffy on other platforms):

>>> print("H₂\u0323₃")
H₂̣₃

it doesn't look that great in my webbrowser though. But I guess most solutions are better than the current state. Would you agree?

The superscript dot I tried doesn't look great, but I think it might be acceptable?

>>> print("Ca²\u02d9³⁺")
Ca²˙³⁺

This is what it looks like on a chromebook (my travelling laptop): image

jeremyagray commented 1 year ago

The aesthetics are horrible, but that was the original problem with the fix. Let me find that original fix and I'll use these characters as stand-in decimal separators, hopefully in a day or two.

bjodah commented 1 year ago

Thank you @jeremyagray ! I'm on vacation at the moment, so no need to stress since I won't be able to upload to PyPI for another couple of days either way.

bjodah commented 1 year ago

Even after we fix the unicode printing issue we will be facing another problem, i.e. the ambiguity between floating point stoichiometries and crystal water. I do know that we discussed this. But it escaped my mind that the change is now between 0.8.2 and 0.8.3:

$ git describe; python3 -c "import chempy; print(chempy.Substance.from_formula('Zn(NO3)2.6H2O').composition)"
v0.8.2
{30: 1, 7: 2, 8: 12, 1: 12}
$ git describe; python3 -c "import chempy; print(chempy.Substance.from_formula('Zn(NO3)2.6H2O').composition)"
v0.8.3-1-ge32e4e5
{30: 1, 7: 2.6, 8: 8.8, 1: 2}

I'm leaning towards creating a 0.8.x branch, reverting the parser changes there, bumping version on master to be upcoming 0.9.x and start thinking of a migration path for "crystal water dot". Perhaps requiring interpunct to get "crystal water behavior"? Any thoughts regarding this @ibressler?

ibressler commented 1 year ago

Thanks for your feedback and fixes for this, it's very much appreciated! Regarding your question @bjodah, since I am not a chemist, I asked my colleagues for their opinion on this:

A middle dot would be typographically more correct but harder to enter by users using the standard available character set. However, fractional occupancies are also possible, and they’d have to be denoted with a floating point number, meaning the standard period (.) cannot serve both functions. So it probably has to be middle dot for crystalline water so we can have float occupancies of other elements.

So, the interpunct should be fine.

bjodah commented 1 year ago

Thank you @ibressler. I'm leaning toward accepting asterisk (*) as an alternative to interpunct too. I envision a version 0.8.4 of chempy where this is supported, and any . in the parsed string will issue a deprecation warning informing users to use e.g. * to indicate crystal water (or any adduct really). Then in chempy 0.9+ we interpret the dot as floating point number. Any thoughts on this @jeremyagray ?

jeremyagray commented 1 year ago

I can't remember if this is a regression. We (I?) addressed the hydrate/decimal subscript issue in the parser by using . for decimal subscripts and .. for hydrates. It's been a while since the parser was updated but it appears the update may be in some of my branches and not in others. I've pushed up a branch with this fix for the hydrate problem and the decimal problem, if anyone is interested in further testing.

bjodah commented 5 months ago

So I have released both:

Let me know if there are any further issues.