EST-Team-Adam / TheReadingMachine

A Mean, Lean, Reading Machine
1 stars 2 forks source link

Incorporate and fix scraper issues #81

Open mkao006 opened 6 years ago

mkao006 commented 6 years ago

Incorporate the implementation by Marco and Luca and make it functional.

There are a few issues faced.

  1. The filter_links_already_seen method doesn't actually work. The scraper takes the same amount of time after multiple iterations. task_duration

  2. It seems the implementation doesn't write back to the database. The table isn't even created, which resulted all downstream to fail. downstream_fail

mkao006 commented 6 years ago

The database issue has been resolved. However, the implementation still takes the same amount of time.

I suspect the main reason for this is because the link has been modified (utf-8 encoding) and thus they do not match with the raw link.

This also resulted in duplicated articles.

mkao006 commented 6 years ago

The link comparison doesn't work as the links extracted are modified. Take the parse_item method in WorldGrainSpider for example.

item['link'] = response.url.replace('http://', '').replace('https://', '')

The https:// and 'http://` prefix are all replaced.

mrpozzi commented 6 years ago

Good job on finding this, but it ain't rocket science to fix it...

def filter_links_already_seen(self, links):
        '''Ignores previously seen links before scraping them
        Taken from https://stackoverflow.com/questions/27649731/crawlspider-ignore-url-before-request
        '''
        for link in links:
            if self.only_new and link.url.replace('http://', '').replace('https://', '') in self.seen_links:
                continue
            else:
                yield link

almost easier to fix than to open (yet another) issue ;)