Closed lewisje closed 4 years ago
@lewisje -- can you review the latest sources-all.json and see if this addresses the problem?
I didn't notice any extra whitespace issues, and I found only three non-satire sources for which no Factual Reporting rating was found by the crawler, but which did have them in the text (all were rated HIGH):
revealnews.org
)livescience.com
)thetimes.co.uk
)Also, I found it odd that the left-center bias rating did not use a hyphen, while pro-science and right-center do, but at least you're consistent.
I'm not sure why the crawler failed to pick out the ratings, because they are in the raw HTML; I first suspected the HTML structure, but the ratings all appear to be in a <strong>
element inside a <span>
element that has a style
attribute that sets the CSS color
property to a specific hex-code, so that's not the issue.
I did not notice issues like Factual Reporting ratings having extraneous content pulled in.
With those thoughts in mind, this is what I found for the new JSON file:
rtg\bias; | left | leftctr | center | rightctr | right | consp | fake | satire | prosci |
---|---|---|---|---|---|---|---|---|---|
very high | 6 | 48 | 2 | 66 | |||||
high | 118 | 270 | 146 | 125 | 30 | 22 | |||
mixed | 68 | 15 | 4 | 13 | 119 | 75 | |||
low | 108 | ||||||||
fake | 276 | ||||||||
satire | 99 |
When I decided to use the JSON file you had generated by crawling mediabiasfactcheck.com, so I could see which sources were rated Very High or High for factual reporting, I found several problems with the data in that file; I created a subreddit for MBFC and posted what I had to do to clean it up, but some of these things could very well be done in the crawler itself (like trimming strings, and further standardizing some codes, like removing parentheticals from the Factual Reporting rating).
If I feel motivated enough, I might just figure out how to make these changes in the crawler, and then send a PR.