flairNLP / fundus

A very simple news crawler with a funny name
MIT License
282 stars 74 forks source link

[Feature Request]: Quality check for extracted articles (javascript code in articles) #275

Closed lukasgarbas closed 5 months ago

lukasgarbas commented 1 year ago

Problem statement

A few examples from the data provided in #54 that are worth taking a look at.

Examples that include javascript code in them:

 },
  "https://www.thegatewaypundit.com/2023/06/fcking-grifters-meghan-markle-faked-podcast-interviews-her/": {
    "publishing_date": "2023-06-19 23:45:04+00:00",
    "title": "\"F*cking Grifters!\" - Meghan Markle Faked Podcast Interviews - Her Voice was Dubbed In Later - The Princess-Victim Was Paid $20 Million for the Gig | The Gateway Pundit | by Jim Hoft | 2",
    "body": "Not a bad gig if you can get it.\nLast week Prince Harry and Meghan Markle were fired from their podcast gig with Spotify. The royal fraudsters had their staff sit for the interview and Meghan’s voice was dubbed in later.\n\nThey were paid a cool $20 million for the gig. The released 12 podcasts.\n\nSpotify fired them last week.\n\nSpotify exec Bill Simmons called the royal pair “f-cking grifters.”\n\nThe New York Post reported:\n\nFOX News reported on the scandal.\n\n!function(r,u,m,b,l,e){r._Rumble=b,r[b]||(r[b]=function(){(r[b]._=r[b]._||[]).push(arguments);if(r[b]._.length==1){l=u.createElement(m),e=u.getElementsByTagName(m)[0],l.async=1,l.src=\"https://rumble.com/embedJS/u5o49d\"+(arguments[1].video?'.'+arguments[1].video:'')+\"/?url=\"+encodeURIComponent(location.href)+\"&args=\"+encodeURIComponent(JSON.stringify([].slice.apply(arguments))),e.parentNode.insertBefore(l,e)}})}(window, document, \"script\", \"Rumble\");\n\nRumble(\"play\", {\"video\":\"v2smpy4\",\"div\":\"rumble_v2smpy4\"});",
    "authors": [
      "Jim Hoft"
    ],
    "topics": []
 },

"https://www.thenation.com/article/archive/nation-readers-summer-books/": {
    "publishing_date": "2010-07-26 19:33:38+00:00",
    "title": "Nation Readers' Summer Books",
    "body": "Two weeks ago we did an informal poll of Nation writers and editors asking them what they were reading this summer. The survey produced some interesting results but not nearly as interesting as the reading suggestions tendered by our readers. \n\n\n\t\t\t\t\t\tfreestar.config.enabled_slots.push({ \n\t\t\t\t\t\t\tplacementName: \"thenation_right_rail\", \n\t\t\t\t\t\t\tslotId: \"thenation_right_rail_80975\",\n\t\t\t\t\t\t\ttargeting:{\n\t\t\t\t\t\t\t\ttn_author: ['the-n'],\n\t\t\t\t\t\t\t\ttn_articleid: [80975],\n\t\t\t\t\t\t\t\ttn_ptype: 'article',\n\t\t\t\t\t\t\t\ttn_keyword: [false],\n\t\t\t\t\t\t\t\ttn_subject: ['books-and-'],\n\t\t\t\t\t\t\t\ttn_pos: 'rectangle_1',\n\t\t\t\t\t\t\t\ttn_loc:'atf'\n\t\t\t\t\t\t\t}\t\n\t\t\t\t\t\t});\n\t\t\t\t\t  \n\nReading lists from coast to coast poured in to our website, our Facebook page, our Twitter feed and our own Nation social network....
},

"https://www.thenation.com/article/archive/vanderbilt-memorial/": {
    "publishing_date": "1869-11-18 23:55:09+00:00",
    "title": "The Vanderbilt Memorial",
    "body": "“The Commodore’s acts have touched the public,\nmore or less nearly, in a spot which is tender.”\n\n“The Commodore’s acts have touched the public, more or less nearly, in a spot which is tender.” \n\n\n\t\t\t\t\t\tfreestar.config.enabled_slots.push({ \n\t\t\t\t\t\t\tplacementName: \"thenation_right_rail\", \n\t\t\t\t\t\t\tslotId: \"thenation_right_rail_77069\",\n\t\t\t\t\t\t\ttargeting:{\n\t\t\t\t\t\t\t\ttn_author: ['the-e'],\n\t\t\t\t\t\t\t\ttn_articleid: [77069],\n\t\t\t\t\t\t\t\ttn_ptype: 'article',\n\t\t\t\t\t\t\t\ttn_keyword: [false],\n\t\t\t\t\t\t\t\ttn_subject: ['history'],\n\t\t\t\t\t\t\t\ttn_pos: 'rectangle_1',\n\t\t\t\t\t\t\t\ttn_loc:'atf'\n\t\t\t\t\t\t\t}\t\n\t\t\t\t\t\t});\n\t\t\t\t\t  \n\nAll of Mr. Vanderbilt’s acts have been in a sense private acts; which consideration has not, however, exempted all of them from public criticism–some of it, certainly, not gracious...
},

"https://www.thenation.com/article/archive/probable-effect-impeachment-trial/": {
    "publishing_date": "1868-05-01 01:48:32+00:00",
    "title": "The Probable Effect of the Impeachment Trial",
    "body": "Americans exercise their good sense by not caring much about the impeachment of President Andrew Johnson.\n\nCSU Archives/Everett Collection.Andrew Johnson \n\n\n\t\t\t\t\t\tfreestar.config.enabled_slots.push({ \n\t\t\t\t\t\t\tplacementName: \"thenation_right_rail\", \n\t\t\t\t\t\t\tslotId: \"thenation_right_rail_76564\",\n\t\t\t\t\t\t\ttargeting:{\n\t\t\t\t\t\t\t\ttn_author: ['the-e'],\n\t\t\t\t\t\t\t\ttn_articleid: [76564],\n\t\t\t\t\t\t\t\ttn_ptype: 'article',\n\t\t\t\t\t\t\t\ttn_keyword: [false],\n\t\t\t\t\t\t\t\ttn_subject: ['history'],\n\t\t\t\t\t\t\t\ttn_pos: 'rectangle_1',\n\t\t\t\t\t\t\t\ttn_loc:'atf'\n\t\t\t\t\t\t\t}\t\n\t\t\t\t\t\t});\n\t\t\t\t\t  \n\nAmericans exercise their good sense by not caring much about the impeachment of President Andrew Johnson.\n\n….Owing to a combination of circumstances, the trial has produced absolutely no excitement in the public mind. All attempts to make it “sensational” have failed utterly...
},

And links to other articles that are of low quality:
"https://www.thenation.com/article/world/qa-india-israel-azad-essa/",
"https://www.thenation.com/article/archive/nation-readers-summer-books/",
"https://www.thenation.com/article/archive/probable-effect-impeachment-trial/",

I found this pattern mostly in The Nation, it would be good to look there first. Although other publishers might also have it (the first example is from Gateway Pundit).

Solution

Maybe better parsing rules can be applied for these examples or some urls from the sitemap can be removed.

Additional Context

No response

MaxDall commented 1 year ago

Thanks for posting this!

This seems to be an interesting case here. I tried to reproduce the extraction with the following lines.

from fundus import PublisherCollection
from fundus.scraping.html import HTMLSource
from fundus.scraping.scraper import Scraper
from fundus.scraping.pipeline import Pipeline

url = "https://www.thenation.com/article/archive/nation-readers-summer-books/"

publisher = PublisherCollection.us.TheNation
source = HTMLSource([url], publisher=publisher.publisher_name)
scraper = Scraper(source, parser=publisher.parser)
pipeline = Pipeline(scraper)

but the article came back fully extracted without the script. Maybe that has something to do with spam protection and only occurs with a high enough request frequency.

Nonetheless, having some kind of quality control / sanity check to prevent text like this is a great idea. Any ideas here?

MaxDall commented 5 months ago

With #382 script tags are no longer accidentally extracted.