elliotgao2 / gain

Web crawling framework based on asyncio.
GNU General Public License v3.0
2.03k stars 207 forks source link

Add some built-in save() methods. #4

Closed elliotgao2 closed 7 years ago

elliotgao2 commented 7 years ago

For examples:

class Post(Item):
    id = Css('title')
    async def save(self):
        super.save(self.results, type='database')
class Post(Item):
    id = Css('title')
    async def save(self):
        super.save(self.results, type='file')

Do you have any suggestions?

c1ay commented 7 years ago
class Post(Item):
    id = Css('title')
    result_cls = FileResultClass

    def __init__(self, *args, **kwargs):
         self.save_cls = self.result_cls(*args, **kwargs)            # init result cls
         pass

    async def on_result(self):
        """
        Called every result
        """
        self.save_cls.save()

class ResultClass:

    async def save(self, *args, **kwargs):
        raise NotImplementedError

class DataBaseResultClass(ResultClass):

    async def save(self, *args, **kwargs):
         pass

class FileResultClass(ResultClass):

    async def save(self):
         pass

I think this will be better on expansibility

elliotgao2 commented 7 years ago

This solution is close to what I want.

Both the DataBaseResultClass and the FileResultClass should have their own config. DataBaseResultClass needs database_url, username, password, database, etc. FileResultClass needs file_path, file_ext, etc.

Any suggestions?

c1ay commented 7 years ago

No good suggestion yet.

It will be More complex if adding a new config file. Add the config like below will be a little better.

class MySpider(Spider):
    start_url = 'https://blog.scrapinghub.com/'
    concurrency = 5
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]

    result_config = {
        "file_path": "/data/tmp.data"
    }