NewsDiffs / newsdiffs

Automatic scraper that tracks changes in news articles over time.
Other
1 stars 0 forks source link

scraper creates "https/" articles directory #1

Open carlgieringer opened 6 years ago

carlgieringer commented 6 years ago

When running the scraper from scratch, there appears a directory articles/https/. There are some articles under this directory, and I don't think they match up with articles not under this directory in the browse view. E.g. articles/https//www.nytimes.com/ don't appear along with articles/www.nytimes.com.

carlgieringer commented 6 years ago

This is due to a legacy artifact in models.Article#filename:

elif ans.startswith('https://'):
            # Terrible hack for backwards compatibility from when https was stored incorrectly,
            # perpetuating the problem
            return 'https:/' + ans[len('https://'):]