CVEProject / cvelistV5

CVE cache of the official CVE List in CVE JSON 5 format
616 stars 139 forks source link

standardize encoding #25

Open eslerm opened 11 months ago

eslerm commented 11 months ago

Currently, most files have ascii encoding, but some are not. At least 213,576 files are ascii and at least 6,330 are non-ascii.

Please unify all json files to a single encoding. A git action to verify encoding would be helpful long term.

For a hacky check:


#!/usr/bin/env python3

import glob
import subprocess

count_ascii = 0
count_non_ascii = 0

for path in glob.glob("cves/**/*.json", recursive=True):
    res = subprocess.run(args=["file", "-bi", path], capture_output=True)
    if b"ascii" in res.stdout:
        count_ascii += 1
    else:
        count_non_ascii += 1
        print(path)

print(f"ascii:     {count_ascii}")
print(f"non-ascii: {count_non_ascii}")
Marcono1234 commented 10 months ago

The JSON files should be UTF-8 encoded I assume (at least the JSON specification requires that). But 7-bit ASCII happens to be subset of UTF-8, which is why some JSON files might appear to use ASCII encoding. So in that in that sense all JSON files should have the same encoding, i.e. UTF-8.

Or did you notice any JSON files which are not using UTF-8 encoding, or contain malformed UTF-8 data?

MarcWarrior commented 5 months ago

The JSON files should be UTF-8 encoded I assume (at least the JSON specification requires that). But 7-bit ASCII happens to be subset of UTF-8, which is why some JSON files might appear to use ASCII encoding. So in that in that sense all JSON files should have the same encoding, i.e. UTF-8.

Or did you notice any JSON files which are not using UTF-8 encoding, or contain malformed UTF-8 data?

Hi, Marcono1234

I'm trying to validate all json file by using the CVE_JSON_5.0_schema.json, and i load json content with open function, the encoding parameter i get from chardet.detect(json_file_content)['encoding'].

Everything is ok until 'CVE-2019-25136.json', the detected encoding is 'charmap': 'charmap' codec can't decode byte 0x81 in position 2466: character maps to...

Line 61: "credits": [ { "lang": "en", "value": "Emilio Cobos Álvarez" } ],

Lang is en, but Á is not en.

It needs to modified? Thanks

Marcono1234 commented 5 months ago

@MarcWarrior, as mentioned above, JSON is expected to be UTF-8 encoded. There is no need to use chardet.detect, just hardcode 'utf-8' as encoding.

And to me it looks like CVE-2019-25136.json contains valid UTF-8 data.