Open 35C4n0r opened 1 year ago
Another possibility could be to vendor this part of the code rather than using whole of beautiful soup in this part: Use BeautifulSoup4 and use it's html.parser for linient parsing to get a correct xml string
.
At last you've said we can either use html.parser -> get correct xml -> DefaultXMLParser
or the cascaded try-except, is there some cases where the html.parser
would also fail?
Regarding slower/slowest can you give some stats on how much slower these are? Is there some comparison/benchmarking you can refer to somewhere? Correctness is much more important here than speed, only after we get everything running correctly can we think about speed IMHO
@35C4n0r we have control over pymaven at https://github.com/nexB/pymaven/branches And this is released at https://pypi.org/project/pymaven-patch/ We should/can completely evolve this as needed because https://github.com/sassoftware/pymaven/ is now dormant so we are the head fork for this unless @wfscheper @lbigelow1 would be willing to unarchive and transfer the pymaven project over?
Update: the above issue have been resolved, only a few tests fail now new_logs
I actually started working on modernizing pymaven recently, and was thinking of asking at work to have it unarchived (it was bulk archived along with a bunch of other repositories). If there's external interest in the project, that is good to know.
@wfscheper re:
I actually started working on modernizing pymaven recently, and was thinking of asking at work to have it unarchived (it was bulk archived along with a bunch of other repositories). If there's external interest in the project, that is good to know.
We use a mild fork as a dependency in ScanCode. I would be fine to rejoin the main upstream if you revive it ... note that:
Our fork is not based on your latest commits but a slightly older version so there would be some work to align it back with your upstream
We maintain extensions in https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/maven.py and in particular we have enhanced properties resolution.
We have some related code in https://github.com/nexB/purldb/ to:
Beyond this, our plan is to also include support Gradle and add a dependency resolver (similar to what we have with python-inspector) and likely bundle all this in a maven-inspector package.
All of that sounds very exciting! I'm happy to work with you to get any enhancements you want to share back merged into pymaven. My current plans are mostly to modernize the packaging, write some actual docs, and maybe add type hints.
The integration of sanexml with Pymaven and SCTK are failing. Here is the log. I think a lot of these are failing because the xmls are not in a proper format and i didn't implement any logic corresponding to the
recover=True
for the XMLParser. What this does that it Parses XML liniently.I see two possibilitoes to solve this issue:
BeautifulSoup4
and use it'shtml.parser
for linient parsing to get a correct xml string and then parse that string normally using python's default XML parsers. (BS4 Docs)If we go with the first approch we can either use this process
html.parser -> get correct xml -> DefaultXMLParser
as the default or we can do:try:
DefaultXMLParser
except: try:html.parser
-> get correct xml ->DefaultXMLParser
# Slower #Lininent except: try:html5lib
-> get correct xml ->DefaultXMLParser
# Slowest #MoreLinient@JonoYang @pombredanne @AyanSinhaMahapatra please share your opinions on this.