Open eslerm opened 11 months ago
The JSON files should be UTF-8 encoded I assume (at least the JSON specification requires that). But 7-bit ASCII happens to be subset of UTF-8, which is why some JSON files might appear to use ASCII encoding. So in that in that sense all JSON files should have the same encoding, i.e. UTF-8.
Or did you notice any JSON files which are not using UTF-8 encoding, or contain malformed UTF-8 data?
The JSON files should be UTF-8 encoded I assume (at least the JSON specification requires that). But 7-bit ASCII happens to be subset of UTF-8, which is why some JSON files might appear to use ASCII encoding. So in that in that sense all JSON files should have the same encoding, i.e. UTF-8.
Or did you notice any JSON files which are not using UTF-8 encoding, or contain malformed UTF-8 data?
Hi, Marcono1234
I'm trying to validate all json file by using the CVE_JSON_5.0_schema.json, and i load json content with open
function, the encoding
parameter i get from chardet.detect(json_file_content)['encoding']
.
Everything is ok until 'CVE-2019-25136.json', the detected encoding is 'charmap':
'charmap' codec can't decode byte 0x81 in position 2466: character maps to...
Line 61:
"credits": [ { "lang": "en", "value": "Emilio Cobos Álvarez" } ],
Lang is en
, but Á
is not en.
It needs to modified? Thanks
@MarcWarrior, as mentioned above, JSON is expected to be UTF-8 encoded. There is no need to use chardet.detect
, just hardcode 'utf-8'
as encoding.
And to me it looks like CVE-2019-25136.json
contains valid UTF-8 data.
Currently, most files have ascii encoding, but some are not. At least 213,576 files are ascii and at least 6,330 are non-ascii.
Please unify all json files to a single encoding. A git action to verify encoding would be helpful long term.
For a hacky check: