standardize encoding - Githubissues

eslerm commented 11 months ago

Currently, most files have ascii encoding, but some are not. At least 213,576 files are ascii and at least 6,330 are non-ascii.

Please unify all json files to a single encoding. A git action to verify encoding would be helpful long term.

For a hacky check:


#!/usr/bin/env python3

import glob
import subprocess

count_ascii = 0
count_non_ascii = 0

for path in glob.glob("cves/**/*.json", recursive=True):
    res = subprocess.run(args=["file", "-bi", path], capture_output=True)
    if b"ascii" in res.stdout:
        count_ascii += 1
    else:
        count_non_ascii += 1
        print(path)

print(f"ascii:     {count_ascii}")
print(f"non-ascii: {count_non_ascii}")

Marcono1234 commented 10 months ago

The JSON files should be UTF-8 encoded I assume (at least the JSON specification requires that). But 7-bit ASCII happens to be subset of UTF-8, which is why some JSON files might appear to use ASCII encoding. So in that in that sense all JSON files should have the same encoding, i.e. UTF-8.

Or did you notice any JSON files which are not using UTF-8 encoding, or contain malformed UTF-8 data?

MarcWarrior commented 5 months ago

The JSON files should be UTF-8 encoded I assume (at least the JSON specification requires that). But 7-bit ASCII happens to be subset of UTF-8, which is why some JSON files might appear to use ASCII encoding. So in that in that sense all JSON files should have the same encoding, i.e. UTF-8.

Or did you notice any JSON files which are not using UTF-8 encoding, or contain malformed UTF-8 data?

Hi, Marcono1234

I'm trying to validate all json file by using the CVE_JSON_5.0_schema.json, and i load json content with open function, the encoding parameter i get from chardet.detect(json_file_content)['encoding'].

Everything is ok until 'CVE-2019-25136.json', the detected encoding is 'charmap': 'charmap' codec can't decode byte 0x81 in position 2466: character maps to...

Line 61: "credits": [ { "lang": "en", "value": "Emilio Cobos Álvarez" } ],

Lang is en, but Á is not en.

It needs to modified? Thanks

Marcono1234 commented 5 months ago

@MarcWarrior, as mentioned above, JSON is expected to be UTF-8 encoded. There is no need to use chardet.detect, just hardcode 'utf-8' as encoding.

And to me it looks like CVE-2019-25136.json contains valid UTF-8 data.

CVEProject / cvelistV5

standardize encoding #25