Incorporate and fix scraper issues

EST-Team-Adam / TheReadingMachine

A Mean, Lean, Reading Machine

1 stars 2 forks source link

Incorporate and fix scraper issues #81

Open mkao006 opened 6 years ago

mkao006 commented 6 years ago

Incorporate the implementation by Marco and Luca and make it functional.

There are a few issues faced.

The filter_links_already_seen method doesn't actually work. The scraper takes the same amount of time after multiple iterations.
It seems the implementation doesn't write back to the database. The table isn't even created, which resulted all downstream to fail.

mkao006 commented 6 years ago

The database issue has been resolved. However, the implementation still takes the same amount of time.

I suspect the main reason for this is because the link has been modified (utf-8 encoding) and thus they do not match with the raw link.

This also resulted in duplicated articles.

mkao006 commented 6 years ago

The link comparison doesn't work as the links extracted are modified. Take the parse_item method in WorldGrainSpider for example.

item['link'] = response.url.replace('http://', '').replace('https://', '')

The https:// and 'http://` prefix are all replaced.

mrpozzi commented 6 years ago

Good job on finding this, but it ain't rocket science to fix it...

def filter_links_already_seen(self, links):
        '''Ignores previously seen links before scraping them
        Taken from https://stackoverflow.com/questions/27649731/crawlspider-ignore-url-before-request
        '''
        for link in links:
            if self.only_new and link.url.replace('http://', '').replace('https://', '') in self.seen_links:
                continue
            else:
                yield link

almost easier to fix than to open (yet another) issue ;)