aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/aboutcode-org/scancode-toolkit/releases/
2.1k stars 543 forks source link

Using BeautifulSoup4 and html.parser to parse XML FIles. #3486

Open 35C4n0r opened 1 year ago

35C4n0r commented 1 year ago

In order to parse XML documents we will be using BeautifulSoup4 and html.parser. Now we are using this option instead of the Python's built in XML Parser is because at times the XML that has to be parsed is malformed and the html.parser is linient in parsing, whereas the standard library only handle well-formed XML.

There is an issue with this approach:

In order to deal with this:

Example:

<parent>    
    <groupId>org.jboss.seam</groupId>   
    <artifactId>root</artifactId>   
    <url>https://github.com/</url>  
    <Url>https://github.com/35C4n0r</Url>   
</parent>

The Mapping:

{'groupId': 'TAG0', 'parent': 'TAG1', 'url': 'TAG2', 'artifactId': 'TAG3', 'Url': 'TAG4'}

The new XML

<TAG1>  
    <TAG0>org.jboss.seam</TAG0> 
    <TAG3>root</TAG3>   
    <TAG2>https://github.com/</TAG2>    
    <TAG4>https://github.com/35C4n0r</TAG4> 
</TAG1>

After parsing it with BeautifullSoup

<tag2>
    <tag4>org.jboss.seam</tag4>
    <tag1>root</tag1>
    <tag0>https://github.com/</tag0>
    <tag3>https://github.com/35C4n0r</tag3>
</tag2>

After using the map to convert the tags back

<parent>
    <groupId>org.jboss.seam</groupId>
    <artifactId>root</artifactId>
    <url>https://github.com/</url>
    <Url>https://github.com/35C4n0r</Url>
</parent>
pombredanne commented 1 year ago

@35C4n0r Can you elaborate on what issue you are facing with details? Which exact files are a problem to parse?

35C4n0r commented 1 year ago

@35C4n0r Can you elaborate on what issue you are facing with details? Which exact files are a problem to parse?

@pombredanne I was just creating this on the Project Board, accidently converted this to an Issue. Anyways I've added a proper description.

pombredanne commented 1 year ago

@35C4n0r thanks! I was just curious! I sometimes do this too. Now I wonder if we ever care about the XML tag case? Because if so you do not even need to build a mapping at all.

35C4n0r commented 1 year ago

@pombredanne we use the POM which has properties attribute, we depend on tag names to extract different properties, example.