EnzymeML / PyEnzyme

🧬 - Data management and modeling framework based on EnzymeML.
BSD 2-Clause "Simplified" License
23 stars 9 forks source link

Units ending in "s" not parsed correctly #9

Closed jmrohwer closed 3 years ago

jmrohwer commented 3 years ago

Currently, because "s" is also part of the regex for unit parsing, any unit ending in "s" (celsius, minutes, seconds, hours) is not parsed correctly: https://github.com/EnzymeML/PyEnzyme/blob/32011fc22ee36bece374dedd2a30454ff384eb77/pyenzyme/enzymeml/tools/unitparser.py#L104-L106 The consequence is that "celsius" would be matched, the prefix would be "celsiu" and the unit "s" which gives an error further on.

Furthermore, the dimensionless units are not included in the regex.

The above can be fixed by the following, i.e. matching for the "s" separately if the first regex does not find a match (although there may be a more elegant way):

diff --git a/pyenzyme/enzymeml/tools/unitparser.py b/pyenzyme/enzymeml/tools/unitparser.py
index a6138bb..7c63b9c 100644
--- a/pyenzyme/enzymeml/tools/unitparser.py
+++ b/pyenzyme/enzymeml/tools/unitparser.py
@@ -102,7 +102,7 @@ class UnitParser(object):
         )

     def __getPrefix(self, string, exponent):
-        regex = "^([a-zA-Z]*)(C|celsius|K|kelvin|M|molar|mole|g|gram|l|L|litre|liter|s|sec|seconds|second|min|mins|minutes|h|hour|hours)$"
+        regex = "^([a-zA-Z]*)(C|celsius|K|kelvin|M|molar|mole|g|gram|l|L|litre|liter|sec|seconds|second|min|mins|minutes|h|hour|hours|abs|absorption|dimensionless)$"
         string = string.lower()[0:-1] + string[-1]

         try:
@@ -120,10 +120,26 @@ class UnitParser(object):
             )

         except IndexError:
-            unit = re.findall(regex, string)[0][0]
-            return (
-                "NONE",
-                unit,
-                exponent
-            )
+            try:
+                regex = "^([a-zA-Z]*)(s)$"
+                prefix = re.findall(regex, string)[0][0]
+
+                if len(prefix) > 1:
+                    prefix = self.__prefixDict[prefix.lower()]
+
+                unit = re.findall(regex, string)[0][1]
+
+                return (
+                    prefix,
+                    unit,
+                    exponent
+                )
+
+            except IndexError:
+                unit = re.findall(regex, string)[0][0]
+                return (
+                    "NONE",
+                    unit,
+                    exponent
+                )
JR-1991 commented 3 years ago

I fixed the bug by changing the regex for "s" to conditionally match only if there is a single "s". Hence, all other expressions with a trailing "s" are now parsed as intended.

Dimensionless units such as "abs" or purely "dimensionless" are already handled before the unitParser is even called in the unitCreator module. See the following code:

https://github.com/EnzymeML/PyEnzyme/blob/127735c150b57ac1883b176365d0b00f6468f9ab/pyenzyme/enzymeml/tools/unitcreator.py#L68-L86

The bug was fixed in b8b6cff and dimensionless unit handling improved in 127735c.

JR-1991 commented 3 years ago

Dimensionless units "abs" and "absorbance" were dropped (see issue #8 ). The given issue is now resolved and closed.