IBM / license-scanner

License Scanner
Apache License 2.0
6 stars 3 forks source link

Fix SPDX 3.18 matching and update default library #24

Closed markstur closed 1 year ago

markstur commented 2 years ago
Update full SPDX 3.18 template support

1. SPDX default library updated with all the 3.18 templates.
2. More flexible matching around inconsistent punctuation, bullets, whitespace, tags, etc...

* bullet/numbering handling improved for more test cases
  - regex matching tweaks
  - removing bullets from licenses, but not removing numbering
  - adding wildcard match to templates
* eat more spaces around template <<regex>> matchers
* eat spaces around commas and colons
* added some unicode chars to our whitepace matcher
* using a placeholder symbol diamond for removed tags instead of "" as needed to preserve word boundaries
* Removing "**" mid-line instead of just begin/end so the block comment
  removal won't cause mismatch with line break differences.
* improved removal of HTML tags to avoid corrupting templates
* Varietal word replacement moved to after whitespace reduction
* For now... using replacement_words.json (replaceVarietalWordSpellings() as implementation of whitespace removal around commas, colons, and **
* More importer/validate SPDX template tests that needed fix/test
* Add diff for importer/validate test errors
* Handle deprecated_ SPDX file name mismatch with a retry
* update tests, as needed (i.e. normalized text changes)

Fixes: #7 Fixes: #9 Fixes: #10