Open kurtseifried opened 2 years ago
@kurtseifried Because the underlying database used by CVE Services is MongoDB and Mongo has a 16MB limit on a single document:
https://www.mongodb.com/docs/manual/reference/limits/#mongodb-limit-BSON-Document-Size
All of the limits in the schema are an attempt for all of the data (when combined into a single record/document) to fit within this limit. If you have supporting media that exceeds the limit in the schema, you could always upload it to wherever you'd like and link to it in references :wink:
Is this 16 megabyte/record limit documented anywhere? Are there other MongoDB limitations we should be aware of? Ditto for other software, e.g. anything else in the CVE pipeline that will result in restrictions?
The descriptions field for example says:
Text in a particular language with optional alternate markup or formatted representation (e.g., Markdown) or embedded media.
Is everything JSON supports available, or are there limitations like Unicode/etc?
It's just a known limitation of BSON document size in Mongo, I don't think it's explicitly noted anywhere in the CVE Services documentation. Note that the largest CVE record currently in existence is a whopping 89 KB :-) Unless we start using CVE records as PDF archives, I don't see us reaching the 16MB limit with text that easily.
As for what you can store in JSON, really anything that passes the schema validation. If it's valid JSON, then it can be stored in Mongo I assume. W/e language you use to construct CVE record JSONs surely has a "json" library that spits out valid JSON with w/e content you include in it (be it unicode or anything else).
Current services architecture using one document object to store all CNA containers. So being mindful that one CNA should not take up all space. If the architecture changes, we can expand the limit (subject to max document size). Target for 6.0.
What could be a strategy to store large descriptions when trying to convert existing CVEs with rich data? Take CVE-2023-38507 for example.
https://osv.dev/vulnerability/GHSA-24q2-59hm-rh9r
The description currently exceeds the 16K limit. The requirement is to produce self contained CVE 5.x records that do not refer to external urls for things like description. Any thoughts?
It's used by very few CVE Records. Should we continue to support this in the schema? If yes, what are the justifications? If yes, how can we increase the size?
From where I sit I think larger descriptions would be very nice to have (see that GHSA linked above as an example).
What could be a strategy to store large descriptions when trying to convert existing CVEs with rich data?
Assuming the entire CVE is stored as a single document then perhaps the individual fields could be unbounded with a size limit placed on the document itself. 16MB for text does feel sufficient for the foreseeable future.
So supportingMedia is either data that can be jammed into JSON (e.g. a text document), or Base64 encoded. One problem: it's limited to 16384 in size, which means with base64 encoding you lose 25% of the space (https://developer.mozilla.org/en-US/docs/Glossary/Base64 "Each Base64 digit represents exactly 6 bits of data."). So that limits it to 12k plus whatever compression you can apply to it. 12k is not a lot of data. Data storage and bandwidth are cheap in 2022, why not up the limit significantly?