-
Last night the Kiwix Wiki monitor was constantly throwing errors (Connection Timeout, 502).
The service did not restart but had apparently difficulties handling a high number of requests.
Peaking at…
-
I think DuckAssistBot is good test case for where we want to draw the line between AI crawlers and other crawlers.
The README currently says
> This is an open list of web crawlers associated wit…
-
I was surprised to find http clients like `python-requests`, `Go-http-client`, `wget`, `curl`, etc included in the crawler list. While I understand that these tools can be abused, in our case a large …
-
Add to the context, the cookie of the session from which the request was made, both for HTTP crawlers and Playwright
-
https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot
-
When implementing #386, we disallowed crawlers for the test system:
```
# robots.txt for test.nwbib.de
User-agent: *
Disallow: /
```
However, currently nothing is disallowed on test, see htt…
-
While running :
> /ail-framework/bin/crawlers/Crawler.py
Does this error related to lacus ?
```
Traceback (most recent call last):
File "/root/ail-framework/AILENV/lib/python3.8/site…
-
## Ability to edit the robots.txt file
The `robots.txt` file is a simple text file placed in the root directory of a website. It serves as a set of instructions for web crawlers (like those used b…
-
### Pitch
The default Mastodon robots.txt file already blocks GPTBot. I'd like to suggest that it should also block some of the other crawlers that scrape sites for data for AI training:
```
Us…
-
- Technical debt introduced in https://github.com/apify/crawlee-python/pull/167
- Should probably be ported over to JS afterwards (v4?)
The reasoning behind the change was that some errors (such a…