Failure of tests/test_paths.py::TestPortablePath::test_safe_path_posix_style_chinese_char

eclipseo commented 1 year ago

Environment:

Python 3.12.0~rc1
Fedora Rawhide
commoncode 31.0.2

The following test fails:

___________ TestPortablePath.test_safe_path_posix_style_chinese_char ___________

self = <test_paths.TestPortablePath testMethod=test_safe_path_posix_style_chinese_char>

    def test_safe_path_posix_style_chinese_char(self):
        test = paths.safe_path(b'/includes/webform.compon\xd2\xaants.inc/')
        expected = 'includes/webform.componNSnts.inc'
>       assert test == expected
E       AssertionError: assert 'includes/web...mponS_nts.inc' == 'includes/web...mponNSnts.inc'
E         - includes/webform.componNSnts.inc
E         ?                        -
E         + includes/webform.componS_nts.inc
E         ?                         +

tests/test_paths.py:74: AssertionError

tests/test_paths.py::TestPortablePath::test_safe_path_posix_style_chinese_char

pombredanne commented 1 year ago

@eclipseo Thanks! We designed these tests for a reason, so they could break as needed, and this looks like this need is now!

Can you tell what is your processor architecture? And what is your locale and filesystem encoding?

Is there a way to get a Fedora Rawhide container image of sorts with Python 3.12.0~rc1 tor reproduce the failure?

Side note: It seems from the trails of questions and issues your leave behind that you are porting ScanCode to Fedora! ... this is awesome!

eclipseo commented 1 year ago

My arch is x86_64 but will be testing on s390x, ppc64le and aarch64.

I think you can images on https://registry.fedoraproject.org/, look for Fedora 40.

For now I have most of the dependencies prepared, but I still have issues with testing extractcode and one other similar, I think I'll file bug for help.

For now, we use Debian's licencecheck in our review tool. One of the legal people from Redhat suggested to replace it with askalano, but as the initial packager for askalano, which I am using for license detection in Golang packaging, it is no better. So I'm looking for alternative, in Python preferably, to plug it into my tools, and potentially into the official review tool if it gives good results

So far it's been good, askalano has trouble when a license file has multiple licenses and also with linking exception.

Fedora has been switching to SPDX and we have more than 25000 packages to go through. We can't automatically convert from the previous notation to SPDX because we called stuff MIT/BSD/CC-BY without specifying the Version contrary to SPDX. And we have new rules for "effective analysis" so basically we need to reanalyze the code base.

pombredanne commented 1 year ago

@eclipseo this is awesome!

You wrote:

I think you can images on https://registry.fedoraproject.org/, look for Fedora 40.

Ideally we should add a basic smoke test for a Fedora container in https://github.com/nexB/skeleton/blob/main/azure-pipelines.yml and use it across all the repos!

Some other comments:

the Debian folks dealing with licensing (including licencecheck) are good buddies. We hang in a the #license channel on Debian's IRC
I know most of the tools in the space, whether licencecheck or askalono or the Google licensecheck or licenseclassifier or Fossology just to name a few. Most only deal with the basic SPDX licenses. We should be as good or better in all cases. We are not the fastest... but we shoot for correctness first and a few other tools and orgs integrate and use ScanCode.
We have the largest databases of licenses out there with over 2K licenses and over 30K license notices
Our detection approach boils down to essentially an improved pair-wise diff which provides better accuracy, and it is able to detect many/all mixed licenses in a file.

We have a few things that should be of interest to you:

ScanCode.io hosts easy to script pipelines to automate scancode runs. It could be a decent base to automate the scan of 25,000 packages. It could be a place where you set your "effective analysis" rules
We have a license clarity score option to indicate when a scan license documentation is great, good or bad and a new --todo option to tell when a human eye is needed to review results.
Each license detection can be grouped when relevant and we have detailed tracing of what was detected and how.
We treat every incorrect or inaccurate detection as a bug to be fixed. ... I mean it. So If you use ScanCode for Fedora, I would like us to collaborate closely so we can fix all the detection bugs
There are likely a few Fedora-specific things we could consider. We do not parse specfiles for instance and there are a few Fedora/RedHat-specific license references and extensions we could easily add

All-in-all I would like to help!

eclipseo commented 1 year ago

@pombredanne If you can help about this: https://github.com/nexB/extractcode/issues/51 This is my remaining blocking part.

eclipseo commented 1 year ago

So it works on Fedora 38 and 39 with

export LC_ALL=C.UTF-8

before the test, but not Fedora 40.

pombredanne commented 2 months ago

@eclipseo the latest release should pass all tests up to 3.11 and I am adding 3.12 support next. Closely related, I have been hitting this bug https://github.com/jawah/charset_normalizer/issues/520

aboutcode-org / commoncode

Failure of tests/test_paths.py::TestPortablePath::test_safe_path_posix_style_chinese_char #56