aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://aboutcode.org/scancode/
2.13k stars 550 forks source link

Integration of sanexml fails. #3483

Open 35C4n0r opened 1 year ago

35C4n0r commented 1 year ago

The integration of sanexml with Pymaven and SCTK are failing. Here is the log. I think a lot of these are failing because the xmls are not in a proper format and i didn't implement any logic corresponding to the recover=True for the XMLParser. What this does that it Parses XML liniently.

I see two possibilitoes to solve this issue:

If we go with the first approch we can either use this process html.parser -> get correct xml -> DefaultXMLParser as the default or we can do:

try: DefaultXMLParser except: try: html.parser -> get correct xml -> DefaultXMLParser # Slower #Lininent except: try: html5lib -> get correct xml -> DefaultXMLParser # Slowest #MoreLinient

@JonoYang @pombredanne @AyanSinhaMahapatra please share your opinions on this.

AyanSinhaMahapatra commented 1 year ago

Another possibility could be to vendor this part of the code rather than using whole of beautiful soup in this part: Use BeautifulSoup4 and use it's html.parser for linient parsing to get a correct xml string.

At last you've said we can either use html.parser -> get correct xml -> DefaultXMLParser or the cascaded try-except, is there some cases where the html.parser would also fail?

Regarding slower/slowest can you give some stats on how much slower these are? Is there some comparison/benchmarking you can refer to somewhere? Correctness is much more important here than speed, only after we get everything running correctly can we think about speed IMHO

pombredanne commented 1 year ago

@35C4n0r we have control over pymaven at https://github.com/nexB/pymaven/branches And this is released at https://pypi.org/project/pymaven-patch/ We should/can completely evolve this as needed because https://github.com/sassoftware/pymaven/ is now dormant so we are the head fork for this unless @wfscheper @lbigelow1 would be willing to unarchive and transfer the pymaven project over?

35C4n0r commented 1 year ago

Update: the above issue have been resolved, only a few tests fail now new_logs

wfscheper commented 1 year ago

I actually started working on modernizing pymaven recently, and was thinking of asking at work to have it unarchived (it was bulk archived along with a bunch of other repositories). If there's external interest in the project, that is good to know.

pombredanne commented 1 year ago

@wfscheper re:

I actually started working on modernizing pymaven recently, and was thinking of asking at work to have it unarchived (it was bulk archived along with a bunch of other repositories). If there's external interest in the project, that is good to know.

We use a mild fork as a dependency in ScanCode. I would be fine to rejoin the main upstream if you revive it ... note that:

  1. Our fork is not based on your latest commits but a slightly older version so there would be some work to align it back with your upstream

  2. We maintain extensions in https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/maven.py and in particular we have enhanced properties resolution.

We have some related code in https://github.com/nexB/purldb/ to:

Beyond this, our plan is to also include support Gradle and add a dependency resolver (similar to what we have with python-inspector) and likely bundle all this in a maven-inspector package.

wfscheper commented 1 year ago

All of that sounds very exciting! I'm happy to work with you to get any enhancements you want to share back merged into pymaven. My current plans are mostly to modernize the packaging, write some actual docs, and maybe add type hints.