Eagerod / html-cruncher

HTML parser
MIT License
0 stars 0 forks source link

Can't handle super broken attribute lists. #19

Closed Eagerod closed 8 years ago

Eagerod commented 8 years ago

Attribute lists that have strangely behaving quotes blow up because of the way that tag parsing occurs. The specific case is where things like meta tags contains unescaped quotes that end up totally destroying the attribute list.

Known failure case:

<meta name="og:title" content=""You're a "Dog">
<meta name="twitter:title" content=""You're a "Dawg">

Expected output:

[
    {
        "dataType": "tag",
        "content": "meta",
        "attributes": {
            "name": {
                "dataType": "attribute",
                "content": "og:title"
            },
            "content": {
                "dataType": "attribute",
                "content": ""
            },
            "You're\"": {
                "dataType": "attribute"
            },
            "a": {
                "dataType": "attribute"
            },
            "Dawg\"": {
                "dataType": "attribute"
            }
        }
    },
    {
        "dataType": "tag",
        "content": "meta",
        "attributes": {
            "name": {
                "dataType": "attribute",
                "content": "twitter:title"
            },
            "content": {
                "dataType": "attribute",
                "content": ""
            },
            "You're\"": {
                "dataType": "attribute"
            },
            "a": {
                "dataType": "attribute"
            },
            "Dawg\"": {
                "dataType": "attribute"
            }
        }
    }
]