The parser IniConfigFile runs into exceptions when the ini file contains Chinese or Japanese strings

shzhou12 commented 2 years ago

For the example ini file:

[Global]
secret-name = "vsphere-creds"
secret-namespace = "kube-system"
insecure-flag = "1"

[Workspace]
server = "xxxxxx"
datacenter = "1-测试部"
default-datastore = "xxxxxxxx"
folder = "/1-测试部/xxxxxxxxx"

[VirtualCenter "xxxxxx"]
datacenters = "1-测试部"

The parser IniConfigFile runs into the following exceptions when parsing the above ini file:

  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/core/__init__.py", line 1441, in parse_content
    super(IniConfigFile, self).parse_content(content)
  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/core/__init__.py", line 349, in parse_content
    self.doc = self.parse_doc(content)
  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/core/__init__.py", line 1438, in parse_doc
    return iniparser.parse_doc("\n".join(content), self, return_defaults=True, return_booleans=False)
  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/parsr/iniparser.py", line 100, in parse_doc
    res = Entry(children=Top(content), src=ctx)
  File "/Users/shan/work/ccx/lib/python3.7/site-packages/insights/parsr/__init__.py", line 356, in __call__
    raise Exception(err.read())
Exception: At line 10 column 17:
KVPair -> EOL
    Expected EOL. Got '测'.
KVPair -> EOF
    Expected end of input. Got '测'.
KVPair -> any whitespace
    Expected any whitespace. Got '测'.
KVPair -> Sep
    Expected Sep. Got '测'.
any whitespace
    Expected any whitespace. Got '测'.
Literal'#'
    Expected '#'. Got '测'.
Literal';'
    Expected ';'. Got '测'.
KVPair -> any whitespace
    Expected any whitespace. Got '测'.
KVPair -> any whitespace
    Expected any whitespace. Got '测'.
KVPair
    Expected 1 of [' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ';', '<', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '\\', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~']. Got '测'.
any whitespace
    Expected any whitespace. Got '测'.
Literal'#'
    Expected '#'. Got '测'.
Literal';'
    Expected ';'. Got '测'.
Header -> any whitespace
    Expected any whitespace. Got '测'.
Header
    Expected [. Got '测'.
any whitespace
    Expected any whitespace. Got '测'.
EOF
    Expected end of input. Got '测'.

xiangce commented 2 years ago

This is caused by the following basic definition of the parsr module https://github.com/RedHatInsights/insights-core/blob/bbc2186e17d039e0f6a3e4a8d3eabcfaec4670f7/insights/parsr/__init__.py#L1252-L1254

The string.printable only includes the following ASCII characters, but no Chinese/Japanese characters:

>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Since the IniConfigFile allows to use of other languages, like Chinese, Japanese, and Korean in the String value, I think It's necessary to add the characters from these languages to the QuotedString. But these languages have too many characters to list all of them in the String set. We may need to find a feasible method to re-define the QuotedString instead of enumerating all the characters.

What's your idea, @bfahr, @ryan-blakley ?

xiangce commented 2 years ago

I like the idea that @koalakangaroo shared with me: we may replace these "invalid" characters/words with some particular or proper words formed from the characters in the pool of valid before parsing it.

This a quite good approach for us to quickly fix this issue, I think it's feasible, just like IP obfuscation that we do during the collection.

@bfahr, @ryan-blakley, Thoughts?

ryan-blakley commented 2 years ago

@xiangce after playing around with the symbols I did find that the unidecode python module can convert to the unicode characters to ascii characters. But it seems that module isn't available in RHEL. You mentioned replacing the characters do you know of an easier way to replace unicode characters?

xiangce commented 2 years ago

@ryan-blakley - Nope, I have no idea about this either.

And I'm also not sure if the unidecode is suitable for this case, just like the following example, after the conversion, a blank space is added for the proper noun "北京" -> "Bei Jing". In the very beginning, I just thought about "replacing" but not "translating".

>>> from unidecode import unidecode
>>> city = "北京"
>>> print(unidecode(city))
Bei Jing

ryan-blakley commented 2 years ago

@xiangce - Yeah I noticed the space that was another reason I figured it wouldn't work. If you're good with replacing, I think the below may work then.

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'/1-测试部/xxxxxxxxx').encode('ascii', 'replace')
'/1-???/xxxxxxxxx'

RedHatInsights / insights-core

The parser IniConfigFile runs into exceptions when the ini file contains Chinese or Japanese strings #3450