RyanMarcus / dirty-json

A parser for invalid JSON
GNU Affero General Public License v3.0
294 stars 30 forks source link

Spacing preserving with unescaped quotes #15

Closed MartinoMensio closed 4 years ago

MartinoMensio commented 5 years ago

I have a little problems with a specific instance. I tried it in the web demo and it concatenates the words that are inside the unescaped quotes (see the "claimReviewed")

Input:

{
    "@context": "http://schema.org",
    "@type": [
        "Review",
        "ClaimReview"
    ],
    "datePublished": "2016-03-31",
    "url": "http://www.politifact.com/north-carolina/statements/2016/mar/30/pat-mccrory/pat-mccrory-wrong-when-he-says-north-carolinas-new/",
    "author": {
        "@type": "Organization",
        "url": "https://www.politifact.com" "twitter": "@politifact"
    },
    "claimReviewed": ""We have not taken away any rights that have currently existed in any city in North Carolina" with the passage of HB2.",
    "claimReviewSiteLogo": "http://static.politifact.com/mediapage/jpgs/politifact-logo-big.jpg",
    "reviewRating": {
        "@type": "Rating",
        "ratingValue": "4",
        "bestRating": "6",
        "text": "False",
        "image": "https://s3.amazonaws.com/share-the-facts/rating_images/politifact/tom-false.jpg"
    },
    "itemReviewed": {
        "@type": "CreativeWork",
        "author": {
            "@type": "Person",
            "name": "Pat McCrory",
            "title": "Governor of North Carolina",
            "image": "http://static.politifact.com.s3.amazonaws.com/politifact%2Fmugs%2FMcCrory_mug.jpg",
            "sameAs": []
        },
        "datePublished": "2016-03-28",
        "sourceName": "A speech in Clayton, NC"
    }
}

Output:

{
    "@context": "http://schema.org",
    "@type": [
        "Review",
        "ClaimReview"
    ],
    "datePublished": "2016-03-31",
    "url": "http://www.politifact.com/north-carolina/statements/2016/mar/30/pat-mccrory/pat-mccrory-wrong-when-he-says-north-carolinas-new/",
    "author": {
        "@type": "Organization",
        "url": "https://www.politifact.com",
        "twitter": "@politifact"
    },
    "claimReviewed": "\"WehavenottakenawayanyrightsthathavecurrentlyexistedinanycityinNorthCarolina\" with the passage of HB2.",
    "claimReviewSiteLogo": "http://static.politifact.com/mediapage/jpgs/politifact-logo-big.jpg",
    "reviewRating": {
        "@type": "Rating",
        "ratingValue": "4",
        "bestRating": "6",
        "text": "False",
        "image": "https://s3.amazonaws.com/share-the-facts/rating_images/politifact/tom-false.jpg"
    },
    "itemReviewed": {
        "@type": "CreativeWork",
        "author": {
            "@type": "Person",
            "name": "Pat McCrory",
            "title": "Governor of North Carolina",
            "image": "http://static.politifact.com.s3.amazonaws.com/politifact%2Fmugs%2FMcCrory_mug.jpg",
            "sameAs": []
        },
        "datePublished": "2016-03-28",
        "sourceName": "A speech in Clayton, NC"
    }
}

First of all: this package saved me tons of hours! 🥇

RyanMarcus commented 5 years ago

That's an interesting case. The problem is that the parser chomps whitespace when it thinks it is outside a quoted string (which is caused by the "")...

It should be fixable. I have a deadline coming up March 1st, and I'll take a look at this afterwards.

Here's a hack you could try in the meantime: use the regex [a-zA-Z]\s[a-zA-Z] to replace all spaces with a special symbol, either unicode or a long string like THISISREALLYASPACE. Then, run through the parser to get clean JSON. Finally, replace that symbol with a space. This won't catch spaces between non-characters, but it should fix a lot of cases without breaking the JSON.

MartinoMensio commented 5 years ago

Thanks I will try that temporary workaround!