Clarify whether base64 encoding of license content is really optional

ben-sag commented 6 months ago

The specs (incl 1.4) are unclear about whether it is mandatory or optional to use base64 encoding for license text's "content".

The fact the doc for "encoding" states it "must be one of"... "base64" makes me think it's mandatory to use base64, but the fact it also says "Specifies the optional encoding" makes me think maybe it's optional (e.g. if I don't specify an encoding parameter maybe it's treated as normal unencoded text). I looked at the examples but there is none that have content without encoding.

Can you clarify, and in the next spec put something into the doc of the "content" field to make clear whether plain unencoded text is permitted? And maybe something in the examples to demonstrate unencoded text.

(btw I'm very much hoping that use of base64 is not mandatory for this field - many licenses make it legally necessary to include the full license text and the SBOM is a very natural place to put that... but I doubt a law court would accept a base64 encoding of the text as meeting that requirement)

mrutkows commented 6 months ago

If you look at the spec./schema (i.e., specifically, https://cyclonedx.org/docs/1.5/json/#metadata_component_licenses_oneOf_i0_items_license_text_content) the "content" field is the ONLY required field in the object. It is of type string with the description "The attachment data. Proactive controls such as input validation and sanitization should be employed to prevent misuse of attachment text." which means it can contain ANY "text" data including (sanitized if possible) plain text. In fact, the stated default value for "contentType" is "text/plain".

If the "content" is optionally "encoded", then the ONLY value allowed for encoding is "base64" which means the "content" would need to be decoded prior to taking any further action using the "contentType" value in order to properly interpret the decoded "text".

In short, feel free to simply fill in the "content" with plain text. However, from an automated processing perspective, base64 encoding is helpful IMO to avoid transmission (whitespace) errors as well as avoids hidden/embedded control characters or functions.

For further clarity, the "one of" language assures that schema will ONLY allow "one of" the values indicated; at this time, ONLY a single value is allowed (only if that field is used) which would be "base64". Again, limiting the encoding types, esp. to a single value, helps in automated processing so that no unexpected/unclear encoding values appear.

ben-sag commented 6 months ago

Thanks that's very helpful! I'm really glad non-encoded is an option.

I did quite carefully read all the spec (including the part you quoted) yesterday - multiple times :) - and couldn't convince myself 100% which way to interpret it (I found reasons for both possible interpretations :) ), so a brief remark in the next spec to make it explicit would be super helpful to avoid possible confusion. Perhaps add an example where the encode= is not specified. And in the "encoding" parameter you could state that the behaviour if the encoding parameter is not specified i.e. that the text is not encoded . (I realise that might sound too obvious to someone who already knows the intention of the spec, but the text stating it must be one of "base64" sent me off on a different interpretation. (I was also looking at this thread https://github.com/CycloneDX/cyclonedx-core-java/issues/100 where it looks like someone else concluded base64 was mandatory)

ben-sag commented 6 months ago

We're having a debate about this on https://github.com/CycloneDX/cyclonedx-core-java/issues/100 where it seems the main Java library for CycloneDX has taken a different interpretation of the encoding question.

jkowalleck commented 6 months ago

PS/EDIT: I learned I was wrong, the following is not true:

encoding is optional with a default value of base64

nope. no default value anymore, this was fixed. therefore, the following statements of mine are incorrect.

at the state of CycloneDX 1.5 and before:

fact: in schema specs, encoding is optional with a default value of base64
meaning: in data transport, the field encoding is optional, its value is always base64. In result, the content value MUST be base64 encoded.

In addition, the contentType describes the content before it was encoded.

example: https://cyclonedx.org/use-cases/#provenance

{
          "contentType": "text/xml",
          "content": "PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiID8+CjxTb2Z0d2FyZUlkZW50aXR5IHhtbDpsYW5nPSJFTiIgbmFtZT0iQ3J5cHRvIExpYnJhcnkiIHZlcnNpb249IjMuMC4wIiAKIHZlcnNpb25TY2hlbWU9Im11bHRpcGFydG51bWVyaWMiIAogdGFnSWQ9InN3aWRnZW4tNWRjYjc5YWYtYTFkMi02MWIzLTM0ZmQtNTM2YzUzYjA4ODEwXzMuMC4wIiAKIHhtbG5zPSJodHRwOi8vc3RhbmRhcmRzLmlzby5vcmcvaXNvLzE5NzcwLy0yLzIwMTUvc2NoZW1hLnhzZCI+IAogeG1sbnM6eHNpPSJodHRwOi8vd3d3LnczLm9yZy8yMDAxL1hNTFNjaGVtYS1pbnN0YW5jZSIgCiB4c2k6c2NoZW1hTG9jYXRpb249Imh0dHA6Ly9zdGFuZGFyZHMuaXNvLm9yZy9pc28vMTk3NzAvLTIvMjAxNS1jdXJyZW50L3NjaGVtYS54c2Qgc2NoZW1hLnhzZCIgPgogIDxNZXRhIGdlbmVyYXRvcj0iU1dJRCBUYWcgT25saW5lIEdlbmVyYXRvciB2MC4xIiAvPiAKICA8RW50aXR5IG5hbWU9IkV4YW1wbGUsIEluYy4iIHJlZ2lkPSJleGFtcGxlLmNvbSIgcm9sZT0idGFnQ3JlYXRvciIgLz4gCjwvU29mdHdhcmVJZGVudGl0eT4="
        }

In short, feel free to simply fill in the "content" with plain text.

this is wrong, as stated before.

ben-sag commented 6 months ago

Aha, well that's a shame. But thanks for clarifying, much appreciated!

Given several of us have come to different conclusions about what the current spec means it's certainly valuable to make the wording in the next spec more explicit.

But I hope you would also consider supporting human-readable/non-base64 encoding in future - the spec allows for other legal texts such as copyrights to be human-readable, and the lack of supporting this for custom license texts makes the format unusable as a machine-readable way to comply with legal requirements.

Even though I much prefer CycloneDX for all other aspects, it looks like I will have so switch over to SPDX SBOMs unfortunately :(

jkowalleck commented 6 months ago

re https://github.com/CycloneDX/specification/issues/382#issuecomment-1954163503

But I hope you would also consider supporting human-readable/non-base64 encoding in future - the spec allows for other legal texts such as copyrights to be human-readable, and the lack of supporting this for custom license texts makes the format unusable as a machine-readable way to comply with legal requirements.

@ben-sag , what do you mean? you can base64 any data, like a PDF contentType=application/pdf, or a plain text contentType=text/plain. This means, you clearly can transport an arbitrary copyright test via CycloneDX, like so:

{content": "KGMpIDIwMjQgSmFuZSBEb2Uu", "encoding": "base64"}

ben-sag commented 6 months ago

Yes you can include any text, but it's not human readable.... so although you're 99% of the way to having a really useful file format that can both list the components and fulfil the legal obligations (e.g. to publish license texts), the fact it's mandated to be in a non-human readable format (base64) stops the second use case. It's fine to allow base64 but forcing it as the only option is a surprising and imho unnecessary limitation. (Especially since you don't do that for other extracted text like copyrights but only license text).

jkowalleck commented 6 months ago

Yes you can include any text, but it's not human readable....

let me ask it this way: would you give a JSON file to your lawyer? and would they open it as a pure text file and read the raw text? literally give a lawyer this example https://raw.githubusercontent.com/CycloneDX/bom-examples/master/SBOM/juice-shop/via_npm/bare/bom.1.4.json
(Who with their right mind would read a 60 thousand lines long structured data document? You would use proper tooling for this, like a program that displays the relevant data in a human-appealing way, right?)

the fact it's mandated to be in a non-human readable format (base64) stops the second use case.

This is during transport. Nothing prevents you from base64-decoding the text, before handing it to some person.

[...] don't do that for other extracted text like copyrights but only license text).

The field you are referring to is of type "attachment", not "text". An "attachment" may be not only a license text; this could also be cryptographic assets, licensing information in a proprietary data format, some binary blob as an evidence for automation-based concluded data, ...

Please understand, that JSON and XML might be human-readable, but are not intended to be consumed by humans. They are data interchange formats for the digital age. These data are ingested by computer applications, like the web browser that you are currently using, which, in fact, received some data from github web servers, which was probably in no human-readable form during transport, and still, your web browser is able to display some text here ;-)

jkowalleck commented 6 months ago

@ben-sag I checked again and I was wrong with my original statement. As of CycloneDX 1.5, the encoding is indeed optional now, no default value anymore. Meaning this would be a valid attachment:

{"content": "(c) 2024 Jane Doe"}

sschuberth commented 6 months ago

As of CycloneDX 1.5, the encoding is indeed optional now, no default value anymore.

Hmm. Does that mean that depending on the specification / schema level chosen, a user of the Java library needs to manually encode e.g. license texts?

stevespringett commented 6 months ago

IMO, adding a method to the Java library that will encode/decode the license text would be a good feature to add.

sschuberth commented 6 months ago

adding a method to the Java library that will encode/decode the license text

IMO, just adding such a function is not of much value, the Java code for that already is a one-liner. Instead, the library should automatically Base64-encode whatever is required to be Base64-encoded.

ben-sag commented 6 months ago

As of CycloneDX 1.5, the encoding is indeed optional now, no default value anymore.

Hmm. Does that mean that depending on the specification / schema level chosen, a user of the Java library needs to manually encode e.g. license texts?

I couldn't see any change to the spec - from v1.2 right through to v1.5 the encoding field is marked as "optional".

So the Java library should not be doing any automatic transformations but just passing through the text unchanged.

I do think given the confusion on this thread there's a very strong case now for a small change making it clear/explicit in the spec whether non-base64 content is permitted (sounds like it is?).

mrutkows commented 6 months ago

nope. no default value anymore, this was fixed. therefore, the following statements of mine are incorrect.

I believe (and asking as well) what Jan is saying, being the gatekeeper for v1.6, is that v1.6 it is no longer "optional"???

However, the current JSON schema still indicates "text" is of type "attachment" which is still described as follows (links to current/committed v1.6 json schema):

   "attachment": {
      "type": "object",
      "title": "Attachment",
      "description": "Specifies the metadata and content for an attachment.",
      "required": [
        "content"
      ],
      ...
      "properties": {
        "contentType": {
          "type": "string",
          "title": "Content-Type",
          "description": "Specifies the content type of the text. Defaults to text/plain if not specified.",
          "default": "text/plain"
        },
        "encoding": {
          "type": "string",
          "title": "Encoding",
          "description": "Specifies the optional encoding the text is represented in.",
          "enum": [
            "base64"
          ],
          "meta:enum": {
            "base64": "Base64 is a binary-to-text encoding scheme that represents binary data in an ASCII string."
          }
        },
        "content": {
          "type": "string",
          "title": "Attachment Text",
          "description": "The attachment data. Proactive controls such as input validation and sanitization should be employed to prevent misuse of attachment text."
        }
      }
    },

which still has the only "required" field being "content" (i.e., "encoding" is not required) ... and the description of "encoding" still has "optional".

Perhaps there is an outstanding (not merged) pull request (PR) that has this projected change?

jkowalleck commented 6 months ago

TLDR; Event with CDX 1.6, attachment encoding is not required, it is optional -- https://github.com/CycloneDX/specification/blob/2873c49989725d08faa6b2d14c332bc9da174348/schema/bom-1.6.schema.json#L1128-L1130 Therefore, omitting the "encoding" field means no encoding is applied, as intended.

I believe (and asking as well) what Jan is saying, being the gatekeeper for v1.6, is that v1.6 it is no longer "optional"???

I was referring to a state when we had a schema issue during CycloneDX 1.3 JSON version only, when there was a default value for the "encoding" -- which was an unintended thing and was fixed in the CycloneDX JSON schema a while ago. This affected only JSON documents that were validated by the schema. This memory caused my confusion. Maybe by the fact that it drove me mad for some days, back then ... 😁

The actual CycloneDX specification never had a default value for the attachment's "encoding" field.

CycloneDX / specification

Clarify whether base64 encoding of license content is really optional #382