Closed lukasgarbas closed 5 months ago
Thanks for posting this!
This seems to be an interesting case here. I tried to reproduce the extraction with the following lines.
from fundus import PublisherCollection
from fundus.scraping.html import HTMLSource
from fundus.scraping.scraper import Scraper
from fundus.scraping.pipeline import Pipeline
url = "https://www.thenation.com/article/archive/nation-readers-summer-books/"
publisher = PublisherCollection.us.TheNation
source = HTMLSource([url], publisher=publisher.publisher_name)
scraper = Scraper(source, parser=publisher.parser)
pipeline = Pipeline(scraper)
but the article came back fully extracted without the script. Maybe that has something to do with spam protection and only occurs with a high enough request frequency.
Nonetheless, having some kind of quality control / sanity check to prevent text like this is a great idea. Any ideas here?
With #382 script
tags are no longer accidentally extracted.
Problem statement
A few examples from the data provided in #54 that are worth taking a look at.
Examples that include javascript code in them:
I found this pattern mostly in The Nation, it would be good to look there first. Although other publishers might also have it (the first example is from Gateway Pundit).
Solution
Maybe better parsing rules can be applied for these examples or some urls from the sitemap can be removed.
Additional Context
No response