elliotgao2 / gain

Web crawling framework based on asyncio.
GNU General Public License v3.0
2.04k stars 207 forks source link

Some Suggestions #20

Closed wisecsj closed 7 years ago

wisecsj commented 7 years ago
  1. Spider class add cookies field cause some websites need login

  2. In parser.py file,there is await item.save(), a function used to store information mostly in local file(user can override the function). As far as i'm concerned, code like

    
    async def save(self):
        with open('scrapinghub.txt', 'a+') as f:
            f.writelines(str(self.results) + '\n')

is blocking as local filesystem access is blocking.Therefore,the event loop(Thread) is blocking.
Especially when we select a MB  size file and want to store in local file, it would slow the whole application.

So, It's that possible use **aiofile**(File support for asyncio,https://github.com/Tinche/aiofiles) or use loop.run_in_executor makes save funciton run in another thread when the file is large?
elliotgao2 commented 7 years ago
  1. Putting the cookies into headers is better.
  2. I agree with you, loop.run_in_executor is better.
wisecsj commented 7 years ago

Get it... Don't need to add cookies field

georgedorn commented 7 years ago

An example of getting cookies from a login and setting them in the header would be helpful. Should I just use the requests library to do the login, then extract the appropriate cookie and set it accordingly?

elliotgao2 commented 7 years ago

@georgedorn Coping cookies from browser is the right way.