-
设置一个中间件
DOWNLOADER_MIDDLEWARES = {
'Article.middlewares.RandomUserAgentMiddleware': 543,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
```
class RandomUserA…
-
I don't know how to get the redirect urls with scrapy-splash,can you help me?
eg.
http://xxx.xxx.xxx/1.php will redirect to http://xxx.xxx.xxx/index.php,how can I get http://xxx.xxx.xxx/index.php wi…
-
disclaimer: I'm not sure if this applies to scrapy as a whole. i just use a few library calls in another spider.
I've been running into some infinite redirect loops (mostly from googledocs and livej…
-
I've spent quite a while going through the documentation, and while I like the concept off pipelines, no-where can I find documentation which shows how to fully implement them end-to-end.
The Pipelin…
-
There are lots of linkextractors with different flavors, but we don't need linkextractors we just need the filters (or processors) and a good way to handle them.
What is the different between using e…
-
**请描述该需求尝试解决的问题**
Hello,
I'd like to suggest to improve git sync functionality in order to make it possible for scenarios where there are dozens (or even hundreds) of spiders. Currently the function…
-
hi:
您好,
我看了一下这个工程,想问一下这个工程的分布式是如何体现的?
“要想尝试分布式,可以在另外一个目录运行此工程”。对句话我不是很理解。
我猜测是:同时运行多个实例,进行抓取。在这种情况下,是否会存在重复抓取的情况(如果在数据库中进行查重判断效率是否会低)?
我的思路是:1个master,n个Slave,媒介为redis。
master:负责ur…
-
Would it make sense to have [`DEFAULT_LOGGING`](https://github.com/scrapy/scrapy/blob/ebef6d7c6dd8922210db8a4a44f48fe27ee0cd16/scrapy/utils/log.py#L45) be read from settings before going through [`dic…
-
From https://github.com/rolando/scrapy-redis/issues/37#issuecomment-193811100
-
I think some logic from [\_\_main\_\_.py](https://github.com/apify/actor-templates/blob/dc5e68805dcf630f35d112a7e113e4f388bbf30a/templates/python-scrapy/src/__main__.py) could be moved to the SDK. I t…