-
## Description
We are using Algolia Crawler UI for parsing our mixed static HTML & SPA website (using hash router). All URLs are provided in `sitemaps` Crawler config.
```js
new Crawler({
st…
-
One neat feature inside Scrapy is it's [LinkExtractors](https://github.com/scrapy/scrapy/blob/64905e3397a5b837312169a0b418857ef1cf40c7/scrapy/linkextractors/lxmlhtml.py) functionality. We usually try …
-
## Summary
Add the option to the LinkExtractor class to consider all tags and attributes (e.g. if you pass `None` then consider all tags/attributes), and `deny_tags` and `deny_attrs` arguments …
-
There are lots of linkextractors with different flavors, but we don't need linkextractors we just need the filters (or processors) and a good way to handle them.
What is the different between using e…
-
Lab https://training.play-with-docker.com/microservice-orchestration/
`python linkextractor.py` does not work because container has python3 only
-
LiDO has an API which can be used to systematically find the references their LinkeXtractor has found in a specific document, see https://linkeddata.overheid.nl/front/portal/services. This can be used…
-
Regression?
I have a HTML file that contains a link like:
`Words`
I'm extracting with code that looks like this:
```
link_extractor = LinkExtractor(
restrict_xpaths=xpath)
tmp_links =…
-
Is this the intended behaviour of `LinkExtractor`? I seem to not be able to extract relative URLs when using it. Alternatively, if I use a selector for `a` elements, I can capture everything.
For r…
-
Hello,
I think it is useful to add priority in Rule, so developers can use CrawlSpider with priority property and the property automatically pass to Spider object.
The expected Rule would be som…
-
### Description
I needed to automatically generated urls from `href="javascript:xxx"` links, and tried to using `LinkExtractor` and `process_value()` as mentioned in [scrapy docs](https://docs.scra…