-
## Description
We are using Algolia Crawler UI for parsing our mixed static HTML & SPA website (using hash router). All URLs are provided in `sitemaps` Crawler config.
```js
new Crawler({
st…
-
One neat feature inside Scrapy is it's [LinkExtractors](https://github.com/scrapy/scrapy/blob/64905e3397a5b837312169a0b418857ef1cf40c7/scrapy/linkextractors/lxmlhtml.py) functionality. We usually try …
-
## Summary
Add the option to the LinkExtractor class to consider all tags and attributes (e.g. if you pass `None` then consider all tags/attributes), and `deny_tags` and `deny_attrs` arguments …
-
Lab https://training.play-with-docker.com/microservice-orchestration/
`python linkextractor.py` does not work because container has python3 only
-
The most recent run of the petsathome_gb spider from 2023-05-15 has returned 50 fewer stores than the previous run from 2023-04-15. I've checked a few of the missing stores, and they all appear to sti…
rjw62 updated
3 weeks ago
-
There are lots of linkextractors with different flavors, but we don't need linkextractors we just need the filters (or processors) and a good way to handle them.
What is the different between using e…
-
Regression?
I have a HTML file that contains a link like:
`Words`
I'm extracting with code that looks like this:
```
link_extractor = LinkExtractor(
restrict_xpaths=xpath)
tmp_links =…
-
LiDO has an API which can be used to systematically find the references their LinkeXtractor has found in a specific document, see https://linkeddata.overheid.nl/front/portal/services. This can be used…
-
Is this the intended behaviour of `LinkExtractor`? I seem to not be able to extract relative URLs when using it. Alternatively, if I use a selector for `a` elements, I can capture everything.
For r…
-
Hello,
I think it is useful to add priority in Rule, so developers can use CrawlSpider with priority property and the property automatically pass to Spider object.
The expected Rule would be som…