-
Assume crawler have set allowed_domains to below list:
`self.allowed_domains = ['albert.zgora.pl']`
Scrapy shouldn't go beyond 'albert.zgora.pl' domain.
But it goes to:
https://www.tumblr.com/wi…
-
http://app.zgsyb.com.cn/paper/layout/202208/26/l01.html
-
https://hzdaily.hangzhou.com.cn/dskb/2022/08/21/page_detail_2_20220821A05.html
好像是会重定向
-
## Motivation
I am currently running a broad crawl on ~3 Million starting URLs using the suggested settings from this [page](https://docs.scrapy.org/en/latest/topics/broad-crawls.html). Since pause…
-
Hi,
So below is a minimal example of the code I use in my spider (spider.py, settings.py, ).
**The problem is, that for the first call and the subsequent (until a few seconds pass by) in parse() f…
-
In the HTML we are using the base tag is set. It also happens that this HTML has huge amount of comment and white space , and base tag is not coming in first 4096 characters.
In the code here - htt…
-
I am not able to create spider
**To Reproduce**
Steps to reproduce the behavior:
1. Created a new project
2. added starting URL and Domain
3. clicked on run
4. See error
**Traceback**
Tr…
-
I am having the damndest time trying to unquote the URL in some requests. Any plans to add that an as option? It seems like I have to monkeypatch to fix it as middleware won't work but monkeypatching …
-
Hi Pascal.
I'm doing a POC for a client where every page is in a subdirectory, and there is no filename per se.
They also have a sitemap, but that's in a cutesy format from their SEO provider, and…
-
bookworm benchmark from https://github.com/scrapy/scrapy-bench/ (see also https://medium.com/@vermaparth/parth-gsoc-f5556ffa4025) shows about 15% slowdown, while more synthetic ``scrapy bench`` shows …