-
-
The upcoming parsel 1.7.0 exposes, and flips, the lxml flag that controls the protection described [here](https://lxml.de/FAQ.html#is-lxml-vulnerable-to-xml-bombs), so it's now possible to scrape cert…
-
**Problem statement**
A typical scenario when using the Scrapy middleware to auto-extract e.g. product page URLs is that said URLs may respond with `404` status.
However, the library does not pr…
-
при сборе kommersant возникает исключение:
```
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
curren…
-
Ahora que se esta en proceso de refactorizar las spiders y agregar items loaders para la recoleccion de datos. Nos vemos con la necesidad de testar las spider de una manera programatica.
Actualmente …
-
I saw the request is replaced with dont_filter=True, if I remove that the spider will just stop when it gets to the same url.
I need to use the offsite middleware though, so any thoughts?
I will do …
-
## 概要
sitemap.xmlを元に、サイトをクロールする
## 仕様候補
- [x] sitemap.xmlをseedとする
- [ ] 複数のsitemap.xmlをseedに設定できる
- [x] サイトマップインデックスも対応可
- パターンに合致したサイトマップを辿る
- [x] 記事パターンに合致したURLの先を取得する
- 除外パターンを登録する方式に…
-
Changing the value of that setting [has been seen to work around some bans](https://github.com/scrapy/scrapy/issues/4951#issuecomment-758185916), so it may be worth mentioning in https://docs.scrapy.o…
-
![unccas](https://user-images.githubusercontent.com/36261426/48948876-52ba1780-ef36-11e8-808b-634153d1e665.jpg)
Projets scrapés sur le site Unccas : adresse non scrapée, possible de le faire? et le l…
-
I have a working spider scraping image URLs and placing them in image_urls field of a scrapy.Item. I have a custom pipeline that inherits from ImagesPipeline. When a specific URL returns a non-200 htt…