Sanitize JSON output from the crawler

lewisje commented 7 years ago

When I decided to use the JSON file you had generated by crawling mediabiasfactcheck.com, so I could see which sources were rated Very High or High for factual reporting, I found several problems with the data in that file; I created a subreddit for MBFC and posted what I had to do to clean it up, but some of these things could very well be done in the crawler itself (like trimming strings, and further standardizing some codes, like removing parentheticals from the Factual Reporting rating).

If I feel motivated enough, I might just figure out how to make these changes in the crawler, and then send a PR.

drmikecrowe commented 7 years ago

@lewisje -- can you review the latest sources-all.json and see if this addresses the problem?

lewisje commented 7 years ago

I didn't notice any extra whitespace issues, and I found only three non-satire sources for which no Factual Reporting rating was found by the crawler, but which did have them in the text (all were rated HIGH):

Left-Center: Reveal News (revealnews.org)
Pro-Science: Live Science (livescience.com)
Right-Center: The Times of London (thetimes.co.uk)

Also, I found it odd that the left-center bias rating did not use a hyphen, while pro-science and right-center do, but at least you're consistent.

I'm not sure why the crawler failed to pick out the ratings, because they are in the raw HTML; I first suspected the HTML structure, but the ratings all appear to be in a <strong> element inside a <span> element that has a style attribute that sets the CSS color property to a specific hex-code, so that's not the issue.

I did not notice issues like Factual Reporting ratings having extraneous content pulled in.

With those thoughts in mind, this is what I found for the new JSON file:

rtg\bias;	left	leftctr	center	rightctr	right	consp
very high	6	48	2	66
high	118	270	146	125	30	22
mixed	68	15	4	13	119	75
low	108
fake	276
satire	99

drmikecrowe / mbfcext

Sanitize JSON output from the crawler #2