aboutcode-org / license-expression

Utility library to parse, normalize and compare License expressions for Python using a boolean logic engine. For expressions using SPDX or any other license id scheme.
http://aboutcode.org
Other
58 stars 24 forks source link

Error thrown when Invalid license key character provided #76

Open rnjudge opened 1 year ago

rnjudge commented 1 year ago

Tern uses license-expression to validate SPDX licenses. When an invalid license key is provided (i.e. contains invalid characters like / or ,), license-expression throws an error when it should handle it.

>>> import license_expression
>>> from license_expression import get_spdx_licensing
>>> licensing = get_spdx_licensing()
>>> license_data = "MIT/X11"
>>> licensing.validate(license_data).errors == []

Traceback (most recent call last):
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 777, in validate
    parsed_expression = self.parse(expression, strict=strict)
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 539, in parse
    tokens = list(self.tokenize(
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 603, in tokenize
    for token in tokens:
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 996, in replace_with_subexpression_by_license_symbol
    for token_group in token_groups:
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 935, in build_token_groups_for_with_subexpression
    tokens = list(tokens)
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 597, in <genexpr>
    tokens = (t for t in tokens if t.string and t.string.strip())
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 921, in build_symbols_from_unknown_tokens
    for symtok in build_token_with_symbol():
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 901, in build_token_with_symbol
    toksym = LicenseSymbol(string)
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 1213, in __init__
    raise ExpressionError(
license_expression.ExpressionError: Invalid license key: the valid characters are: letters and numbers, underscore, dot, colon or hyphen signs and spaces: 'MIT/X11'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 780, in validate
    expression_info.invalid_symbols.append(e.token_string)
AttributeError: 'ExpressionError' object has no attribute 'token_string'
>>> license_data = "MIT,X11"
>>> licensing.validate(license_data).errors == []
Traceback (most recent call last):
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 777, in validate
    parsed_expression = self.parse(expression, strict=strict)
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 539, in parse
    tokens = list(self.tokenize(
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 603, in tokenize
    for token in tokens:
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 996, in replace_with_subexpression_by_license_symbol
    for token_group in token_groups:
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 935, in build_token_groups_for_with_subexpression
    tokens = list(tokens)
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 597, in <genexpr>
    tokens = (t for t in tokens if t.string and t.string.strip())
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 921, in build_symbols_from_unknown_tokens
    for symtok in build_token_with_symbol():
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 901, in build_token_with_symbol
    toksym = LicenseSymbol(string)
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 1213, in __init__
    raise ExpressionError(
license_expression.ExpressionError: Invalid license key: the valid characters are: letters and numbers, underscore, dot, colon or hyphen signs and spaces: 'MIT,X11'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rose/ternenv/lib/python3.10/site-packages/license_expression/__init__.py", line 780, in validate
    expression_info.invalid_symbols.append(e.token_string)
AttributeError: 'ExpressionError' object has no attribute 'token_string'

When a valid license key is provided (i.e. no unexpected characters), the library returns as expected:

>>> license_data = "MIT-X11"
>>> licensing.validate(license_data).errors == []
False

I would expect the library to handle unexpected characters and mark expressions with unexpected characters as an invalid license.

rnjudge commented 1 year ago

@pombredanne any thoughts on this?

pombredanne commented 1 year ago

@rnjudge

"MIT/X11" is not a valid license key: not an SPDX one and it further contains characters typically not supported in the SPDX spec.

There are multiple tokenizers to handle an expression: a simple one or one based on an automaton. This later one accepts arbitrary strings. A simple way to do things is to create multiple aliases for a given license symbol:

>>> symbol = LicenseSymbol(key="MIT", aliases=["MIT/X11", "MIT,X11"])
>>> l = Licensing(symbols=[symbol])
>>> l.parse("MIT/X11", simple=False)
LicenseSymbol('MIT', aliases=('MIT/X11', 'MIT,X11'), is_exception=False)

Here simple=False forces using the advanced automaton-based tokenizer that can recognize most any alias strings even with spaces or not correct syntax-wise.

You would need to know ahead of time all the supported aliases and build you own licensing for this.

Alternatively, if you have a list of these, we could also add these aliases as a standard "key_aliases" in each license record in https://github.com/nexB/scancode-toolkit/blob/cc14890e1bb6264b01ddb96975cac54466bd6a64/src/licensedcode/models.py#L272 and then update the code here to also treat "key_aliases" as LicenseSymbol aliases in https://github.com/nexB/license-expression/blob/15481270d1080d18e94ad5c5e9618f07e07eb933/src/license_expression/__init__.py#L868

Note also that using scancode-toolkit will always be better for this:

>>> from licensedcode.cache import get_index
>>> idx = get_index()
>>> idx.match(query_string="MIT/X11", as_expression=True)
[LicenseMatch: 'mit', lines=(1, 1), matcher='1-hash', rid=mit_366.RULE, sc=99.0, cov=100.0, len=2, hilen=1, rlen=2, qreg=(0, 1), ireg=(0, 1)]

But in practice, each package type/ecosystem will have its specialized ways to provide license information so this approach will onot always work and the packagedcode module handles this for each package manifest and formats already: https://github.com/nexB/scancode-toolkit/search?q=populate_license_fields&type=code

Even the standard code that works mostly across package types does much more than just using the license_expression library: https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/licensing.py

pombredanne commented 1 year ago

See also https://github.com/nexB/license-expression/issues/70 by @ivanayov