aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.14k stars 552 forks source link

Wrong URL extraction for multi-line strings inside PO files #3946

Open stefan6419846 opened 1 month ago

stefan6419846 commented 1 month ago

Description

Extracting PO header values spread across multiple lines does not work and only considers the individual lines.

How To Reproduce

Run URL analysis on https://github.com/django/django/blob/97c05a64ca87253e9789ebaab4b6d20a1b2370cf/django/contrib/admin/locale/en_GB/LC_MESSAGES/django.po

This will report http://www.transifex.com/django/ as the URL instead of the full one, which would be http://www.transifex.com/django/django/language/en_GB/.

System configuration

AyanSinhaMahapatra commented 3 weeks ago

Thanks @stefan6419846 for the report, this is a bug indeed and should be fixed!

We could likely do something specific for po files, but it could be interesting to look at more examples like this in different kind of files and stitch together URLs from multiple lines based on some patterns/heuristics (like if there are multiple / on the next lines)