Concerns about size of supportingMedia

CVEProject / cve-schema

This repository is used for the development of the CVE JSON record format. Releases of the CVE JSON record format will also be published here. This repository is managed by the CVE Quality Working Group.

Creative Commons Zero v1.0 Universal

258 stars 143 forks source link

Concerns about size of supportingMedia #200

Open kurtseifried opened 2 years ago

kurtseifried commented 2 years ago

So supportingMedia is either data that can be jammed into JSON (e.g. a text document), or Base64 encoded. One problem: it's limited to 16384 in size, which means with base64 encoding you lose 25% of the space (https://developer.mozilla.org/en-US/docs/Glossary/Base64 "Each Base64 digit represents exactly 6 bits of data."). So that limits it to 12k plus whatever compression you can apply to it. 12k is not a lot of data. Data storage and bandwidth are cheap in 2022, why not up the limit significantly?

                        "value": {
                            "type": "string",
                            "description": "Supporting media content, up to 16K. If base64 is true, this field stores base64 encoded data.",
                            "minLength": 1,
                            "maxLength": 16384
                        }

mprpic commented 2 years ago

@kurtseifried Because the underlying database used by CVE Services is MongoDB and Mongo has a 16MB limit on a single document:

https://www.mongodb.com/docs/manual/reference/limits/#mongodb-limit-BSON-Document-Size

All of the limits in the schema are an attempt for all of the data (when combined into a single record/document) to fit within this limit. If you have supporting media that exceeds the limit in the schema, you could always upload it to wherever you'd like and link to it in references :wink:

kurtseifried commented 2 years ago

Is this 16 megabyte/record limit documented anywhere? Are there other MongoDB limitations we should be aware of? Ditto for other software, e.g. anything else in the CVE pipeline that will result in restrictions?

The descriptions field for example says:

Text in a particular language with optional alternate markup or formatted representation (e.g., Markdown) or embedded media.

Is everything JSON supports available, or are there limitations like Unicode/etc?

mprpic commented 2 years ago

It's just a known limitation of BSON document size in Mongo, I don't think it's explicitly noted anywhere in the CVE Services documentation. Note that the largest CVE record currently in existence is a whopping 89 KB :-) Unless we start using CVE records as PDF archives, I don't see us reaching the 16MB limit with text that easily.

As for what you can store in JSON, really anything that passes the schema validation. If it's valid JSON, then it can be stored in Mongo I assume. W/e language you use to construct CVE record JSONs surely has a "json" library that spits out valid JSON with w/e content you include in it (be it unicode or anything else).

chandanbn commented 1 year ago

Current services architecture using one document object to store all CNA containers. So being mindful that one CNA should not take up all space. If the architecture changes, we can expand the limit (subject to max document size). Target for 6.0.

prabhu commented 5 months ago

What could be a strategy to store large descriptions when trying to convert existing CVEs with rich data? Take CVE-2023-38507 for example.

https://osv.dev/vulnerability/GHSA-24q2-59hm-rh9r

The description currently exceeds the 16K limit. The requirement is to produce self contained CVE 5.x records that do not refer to external urls for things like description. Any thoughts?

ccoffin commented 5 days ago

It's used by very few CVE Records. Should we continue to support this in the schema? If yes, what are the justifications? If yes, how can we increase the size?

darakian commented 2 hours ago

From where I sit I think larger descriptions would be very nice to have (see that GHSA linked above as an example).

What could be a strategy to store large descriptions when trying to convert existing CVEs with rich data?

Assuming the entire CVE is stored as a single document then perhaps the individual fields could be unbounded with a size limit placed on the document itself. 16MB for text does feel sufficient for the foreseeable future.