Drop support for optional data values starting with "x_"

jayjacobs commented 6 months ago

Having optional fields in structured data adds complexity for parsing at scale AND it is not being used in any sort of practical way right now.

there are two fields in use that are generated by the upconvert process that are either for legacy reasons or specifically for the CNA, not for the data or any end users.

containers.cna.x_ConverterErrors
containers.cna.x_legacyV4Record

There is one field added by what I assume are cve/json generators and we should consider adding this as a standard field into the schema:

containers.cna.x_generator - Note this has different formats in different records:
str (2389): CVE-2022-2913, CVE-2022-0836, CVE-2021-4248, CVE-2022-1788, CVE-2022-1005
dict (21564): CVE-2019-15966, CVE-2024-35996, CVE-2021-47172, CVE-2024-0559, CVE-2022-45790

And then there are three fields that may be isolated to redhat:

containers.cna.x_redHatCweChain (2 CVEs)
containers.cna.x_redhatCweChain (444 CVEs)
containers.cna.affected.x_redhatStatus (2 CVEs)

There are no other uses of any optional "x_" instances.

mprpic commented 6 months ago

containers.cna.x_redHatCweChain (2 CVEs)

These have been cleaned up.

containers.cna.x_redhatCweChain (444 CVEs)

We use this for all CVEs where we have more fine-grained CWE data, so I do find it useful. I would probably also argue there are other cases where some data can be included in x_ attributes first just to identify its viability and then move it to an official field and clean up the previous usages. I thought that was the whole point of x_ attributes.

containers.cna.affected.x_redhatStatus (2 CVEs)

These are no longer used because x_ attributes are only allowed at the root level. The two records that still had them were also cleaned up.

jayjacobs commented 6 months ago

Awesome, thanks for the response @mprpic and appreciate the cleanup.

The request to drop "x_*" fields is not based on historical precedence at all. There may have been decisions made and discussions about the value and purpose of these optional fields. I'm just pointing out in practice there is now one custom field from one CNA (and then the x_generator field), and the cost of writing an exception into parsers may not be worth the global benefit of allowing anyone to stick anything into structured data.

I am working through my experience parsing and using this data. Having optional fields that could be applied inconsistency across records opens up parsers to potential errors and even modes of failures. While I currently have been forced to create an exception in my code for this one source of JSON data, it would be nice (higher quality, cleaner and standardized data) if every JSON parser created did not have to write an exception that requires a regex check against fields for this data.

mprpic commented 6 months ago

x_generator could indeed be made into a proper object similar to how the CSAF standard does it: https://docs.oasis-open.org/csaf/csaf/v2.0/os/csaf-v2.0-os.html#321123-document-property---tracking---generator

Our CWE data could also be moved into a specific note of some sort or somewhere else, so I guess I'd be fine with dropping the x_ attributes.

Though, just out of curiosity, how are you parsing this data that there extra fields would cause issues? In my mind, whatever you get from the record that validates against the schema is correct, and you get to choose whichever fields you want to make use of. You can completely ignore any attributes that start with x_ and not handle them at all.

jayjacobs commented 6 months ago

I am trying to streamline and standardize all of my JSON intake and I'm building out multiple layers/steps for each data ingestion task. Meaning I have a single JSON parsing level that infers the data schema from the JSON data (since many JSON sources do not create schemas and/or maintain them properly). The challenge here is that JSON stores everything as a single stand-alone record, and for data analysis and knowledge extraction I need the data in a columnar format. Which means extracting structured and normalized data and then storing data in parquet files. If one CVE record adds a "x_foobar" field, the parser would add an "x_foobar" field for every CVE record and leave it blank for all but one. Moving to a columnar format is very common for data analysis.

The main challenges I have with the "x_*" records is that any one of the few hundred CNAs can "legally" dump whatever structure they want and I have to deal with it or code up an exception for this one particular source of data (which seems like I'll need to do for other data sources anyway). A great example of this is the decision to include a full copy of the v4 record in the x_legacyV4Record, the first few passes I did on a generic JSON parser took twice as long to run (embarrassingly my first parser took over 2 hours to run, excluding the x_legacyV4Record it was 1h20m on the full collection of CVE data, now it's down to under 10 minutes). That x_legacyV4Record section added several dozen (unnecessary) parquet files and generated several data consistency errors (since many v4 records fails schema validation).

After saying all of that, my local solution isn't really what this issue is about. I'll have to work in exceptions for weird data into my process anyway and this suggested change to the 5.2 schema isn't going to affect my own implementation or help solve a problem for me personally because I'm going to code around it either way.

Allowing optional data (perhaps complex data structures) with federated data collection can present a challenge to anyone trying to automatically parse the data. The fewer "surprises" we can keep out of the data the better. And combined with the fact that nobody is taking any sort of real advantage of these fields means that the CVE program is supporting complexity with little (or no) reward. Hence I suggest we remove the support for something that presents risks to consumers/parsers with very little benefit.

mprpic commented 6 months ago

+1, your reasoning makes sense!

Will you be sharing the final parquet files somewhere btw? It would be neat to define some common queries on top of those files that could be run through DuckDB à la https://www.nikolasgoebel.com/2024/05/28/duckdb-doesnt-need-data.html :-)

jayjacobs commented 6 months ago

My first pass at parsing CVE JSON into parquet files was in R and is rather slow, but those parquet files are in this repo: https://github.com/jayjacobs/visualcve/tree/main/cache/cveparse. The latest pass is in python and a lot more robust (and much faster). That repo is the backend for the github pages: https://jayjacobs.github.io/visualcve/

I had intended to refresh that repo fairly often, but so far that's maybe weekly if I'm lucky. :)

jayjacobs commented 3 weeks ago

Reference to #175 as well

CVEProject / cve-schema

Drop support for optional data values starting with "x_" #318