elliotgao2 / gain

Web crawling framework based on asyncio.
GNU General Public License v3.0
2.04k stars 207 forks source link

Add support to handle the value of each field of an item. #9

Closed howie6879 closed 7 years ago

howie6879 commented 7 years ago

For example:

from gain import Css, Item, Parser, Spider

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        with open('scrapinghub.txt', 'a+') as f:
            f.writelines(self.results['title'] + '\n')
 # Add function to handle value
class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    def clean_title(self,title):
        return title.strip()

    async def save(self):
        with open('scrapinghub.txt', 'a+') as f:
            f.writelines(self.results['title'] + '\n')

Then in https://github.com/gaojiuli/gain/blob/master/gain/item.py

class Item(metaclass=ItemType):
    def __init__(self, html):
        self.results = {}
        for name, selector in self.selectors.items():
            value = selector.parse_detail(html)
            # Add function to handle value
            get_field = getattr(self, 'clean_%s' % name, None)
            if get_field:
                value = get_field(value)
            if value is None:
                logger.error('Selector "{}" for {} was wrong, please check again'.format(selector.rule, name))
            else:
                self.results[name] = value
elliotgao2 commented 7 years ago

You mean setting a callback function to each Selector? like:

Css('title', lambda s: s+s)

or


def clean_str(s):
    return s+s

Css('title', clean_str)

Why not do that in save() method? like:

async def save(self):
    self.title = clean_str(self.title)
    self.content = tomd.convert(self.content)
howie6879 commented 7 years ago

thanks