Closed shzhou12 closed 2 years ago
This is caused by the following basic definition of the parsr
module
https://github.com/RedHatInsights/insights-core/blob/bbc2186e17d039e0f6a3e4a8d3eabcfaec4670f7/insights/parsr/__init__.py#L1252-L1254
The string.printable
only includes the following ASCII characters, but no Chinese/Japanese characters:
>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
Since the IniConfigFile allows to use of other languages, like Chinese, Japanese, and Korean in the String value, I think It's necessary to add the characters from these languages to the QuotedString
. But these languages have too many characters to list all of them in the String
set. We may need to find a feasible method to re-define the QuotedString
instead of enumerating all the characters.
What's your idea, @bfahr, @ryan-blakley ?
I like the idea that @koalakangaroo shared with me: we may replace these "invalid"
characters/words with some particular or proper words formed from the characters in the pool of valid
before parsing it.
This a quite good approach for us to quickly fix this issue, I think it's feasible, just like IP obfuscation that we do during the collection.
@bfahr, @ryan-blakley, Thoughts?
@xiangce after playing around with the symbols I did find that the unidecode python module can convert to the unicode characters to ascii characters. But it seems that module isn't available in RHEL. You mentioned replacing the characters do you know of an easier way to replace unicode characters?
@ryan-blakley - Nope, I have no idea about this either.
And I'm also not sure if the unidecode
is suitable for this case, just like the following example, after the conversion, a blank space
is added for the proper noun "北京"
-> "Bei Jing"
. In the very beginning, I just thought about "replacing" but not "translating".
>>> from unidecode import unidecode
>>> city = "北京"
>>> print(unidecode(city))
Bei Jing
@xiangce - Yeah I noticed the space that was another reason I figured it wouldn't work. If you're good with replacing, I think the below may work then.
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'/1-测试部/xxxxxxxxx').encode('ascii', 'replace')
'/1-???/xxxxxxxxx'
For the example ini file:
The parser IniConfigFile runs into the following exceptions when parsing the above ini file: