IHTSDO / snomed-scg-parser

An Open Source Java library for parsing SNOMED Post-Coordinated expressions written in SNOMED Compositional Grammar.
Other
7 stars 4 forks source link

Common UTF-8 characters in terms break parsing #10

Open johngrimes opened 1 year ago

johngrimes commented 1 year ago

The grammar at parser-generation/SCG.txt (and the specification) seem to define a subset of UTF-8 that is valid for use within a "term".

Unfortunately this seems to exclude many characters that are common, for example this expression is invalid because of the é:

<<< 45815001|Béclard's hernia|:{116676008|Associated morphology|=414403008|Herniated structure|,363698007|Finding site|=818993005|Structure of organ within abdominopelvic cavity|},{116676008|Associated morphology|=414402003|Hernial opening|,363698007|Finding site|=79908009|Saphenous opening structure|}

Parsing this produces the following error:

line 1:15 mismatched input 'c' expecting {'\u0080', '\u0081', '\u0082', '\u0083', '\u0084', '\u0085', '\u0086', '\u0087', '\u0088', '\u0089', '\u008A', '\u008B', '\u008C', '\u008D', '\u008E', '\u008F', '\u0090', '\u0091', '\u0092', '\u0093', '\u0094', '\u0095', '\u0096', '\u0097', '\u0098', '\u0099', '\u009A', '\u009B', '\u009C', '\u009D', '\u009E', '\u009F', '\u00A0', '\u00A1', '\u00A2', '\u00A3', '\u00A4', '\u00A5', '\u00A6', '\u00A7', '\u00A8', '\u00A9', '\u00AA', '\u00AB', '\u00AC', '\u00AD', '\u00AE', '\u00AF', '\u00B0', '\u00B1', '\u00B2', '\u00B3', '\u00B4', '\u00B5', '\u00B6', '\u00B7', '\u00B8', '\u00B9', '\u00BA', '\u00BB', '\u00BC', '\u00BD', '\u00BE', '\u00BF'}

Is this the intent? Could the legal characters that can be used within a term in SCG be aligned with the characters that can be used within a description?