drmikecrowe / mbfcext

Media Bias Fact Check extension
https://drmikecrowe.github.io/mbfcext/
MIT License
36 stars 7 forks source link

Sanitize JSON output from the crawler #2

Closed lewisje closed 4 years ago

lewisje commented 7 years ago

When I decided to use the JSON file you had generated by crawling mediabiasfactcheck.com, so I could see which sources were rated Very High or High for factual reporting, I found several problems with the data in that file; I created a subreddit for MBFC and posted what I had to do to clean it up, but some of these things could very well be done in the crawler itself (like trimming strings, and further standardizing some codes, like removing parentheticals from the Factual Reporting rating).

If I feel motivated enough, I might just figure out how to make these changes in the crawler, and then send a PR.

drmikecrowe commented 7 years ago

@lewisje -- can you review the latest sources-all.json and see if this addresses the problem?

lewisje commented 7 years ago

I didn't notice any extra whitespace issues, and I found only three non-satire sources for which no Factual Reporting rating was found by the crawler, but which did have them in the text (all were rated HIGH):

Also, I found it odd that the left-center bias rating did not use a hyphen, while pro-science and right-center do, but at least you're consistent.

I'm not sure why the crawler failed to pick out the ratings, because they are in the raw HTML; I first suspected the HTML structure, but the ratings all appear to be in a <strong> element inside a <span> element that has a style attribute that sets the CSS color property to a specific hex-code, so that's not the issue.

I did not notice issues like Factual Reporting ratings having extraneous content pulled in.

With those thoughts in mind, this is what I found for the new JSON file:

rtg\bias; left leftctr center rightctr right consp fake satire prosci
very high 6 48 2 66
high 118 270 146 125 30 22
mixed 68 15 4 13 119 75
low 108
fake 276
satire 99