Open rnjudge opened 1 year ago
@pombredanne any thoughts on this?
@rnjudge
"MIT/X11" is not a valid license key: not an SPDX one and it further contains characters typically not supported in the SPDX spec.
There are multiple tokenizers to handle an expression: a simple one or one based on an automaton. This later one accepts arbitrary strings. A simple way to do things is to create multiple aliases for a given license symbol:
>>> symbol = LicenseSymbol(key="MIT", aliases=["MIT/X11", "MIT,X11"])
>>> l = Licensing(symbols=[symbol])
>>> l.parse("MIT/X11", simple=False)
LicenseSymbol('MIT', aliases=('MIT/X11', 'MIT,X11'), is_exception=False)
Here simple=False
forces using the advanced automaton-based tokenizer that can recognize most any alias strings even with spaces or not correct syntax-wise.
You would need to know ahead of time all the supported aliases and build you own licensing for this.
Alternatively, if you have a list of these, we could also add these aliases as a standard "key_aliases" in each license record in https://github.com/nexB/scancode-toolkit/blob/cc14890e1bb6264b01ddb96975cac54466bd6a64/src/licensedcode/models.py#L272 and then update the code here to also treat "key_aliases" as LicenseSymbol aliases in https://github.com/nexB/license-expression/blob/15481270d1080d18e94ad5c5e9618f07e07eb933/src/license_expression/__init__.py#L868
Note also that using scancode-toolkit will always be better for this:
>>> from licensedcode.cache import get_index
>>> idx = get_index()
>>> idx.match(query_string="MIT/X11", as_expression=True)
[LicenseMatch: 'mit', lines=(1, 1), matcher='1-hash', rid=mit_366.RULE, sc=99.0, cov=100.0, len=2, hilen=1, rlen=2, qreg=(0, 1), ireg=(0, 1)]
But in practice, each package type/ecosystem will have its specialized ways to provide license information so this approach will onot always work and the packagedcode module handles this for each package manifest and formats already: https://github.com/nexB/scancode-toolkit/search?q=populate_license_fields&type=code
Even the standard code that works mostly across package types does much more than just using the license_expression library: https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/licensing.py
See also https://github.com/nexB/license-expression/issues/70 by @ivanayov
Tern uses license-expression to validate SPDX licenses. When an invalid license key is provided (i.e. contains invalid characters like
/
or,
), license-expression throws an error when it should handle it.When a valid license key is provided (i.e. no unexpected characters), the library returns as expected:
I would expect the library to handle unexpected characters and mark expressions with unexpected characters as an invalid license.