-
Hi,
So below is a minimal example of the code I use in my spider (spider.py, settings.py, ).
**The problem is, that for the first call and the subsequent (until a few seconds pass by) in parse() f…
rubmz updated
1 month ago
-
bookworm benchmark from https://github.com/scrapy/scrapy-bench/ (see also https://medium.com/@vermaparth/parth-gsoc-f5556ffa4025) shows about 15% slowdown, while more synthetic ``scrapy bench`` shows …
-
Hi Pascal,
I am working on a website which include different domains, such as...
```
// Below are the domains in the start url section
www.rthk.hk
app3.rthk.hk
app4.rthk.hk
programme.rthk.hk
…
-
## the scrapy understand
Scrapy是一个应用程序框架,用于对网站进行爬行和提取结构化数据,这些结构化数据可用于各种有用的应用程序,如数据挖掘、信息处理或历史存档。
#### 创建项目
cmd运行`scrapy startproject tutorial`,新建一个项目
创建一个tutorial目录:
tutorial/
scrapy.cfg 部署配…
-
Hi Pascal.
I'm doing a POC for a client where every page is in a subdirectory, and there is no filename per se.
They also have a sitemap, but that's in a cutesy format from their SEO provider, and…
-
During the global build at 2021-09-15-14-42-44, spider **spar_no** failed with **0 features** and **0 errors**.
Here's [the log](https://data.alltheplaces.xyz/runs/2021-09-15-14-42-44/logs/spar_no.tx…
-
In the at least 3-rd to 5-th steps of the [Application Containerization and Microservice Orchestration](https://training.play-with-docker.com/microservice-orchestration/) tutorial, running provided co…
-
I don't know if this goes here e_e but I've had problems when trying to parse a tar.gz as an html (now I check the extension) and I want to propose to include this type of file as an ignored one in sc…
-
### Description
The `OffsiteMiddleware` logs a single message for each domain filtered. Great!
But then the `core.engine` logs a message for every single url filtered by the OffsiteMiddleware.
(L…
-
```
Stacktrace (most recent call last):
File "scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
…