crawlers Search Results

kiwix/operations #240

What about crawlers?

Last night the Kiwix Wiki monitor was constantly throwing errors (Connection Timeout, 502). The service did not restart but had apparently difficulties handling a high number of requests. Peaking at…

rgaudin updated 2 months ago

ai-robots-txt/ai.robots.txt #53

The case of DuckAssistBot: real-time vs LLM bots

I think DuckAssistBot is good test case for where we want to draw the line between AI crawlers and other crawlers. The README currently says > This is an open list of web crawlers associated wit…

nisbet-hubbard updated 1 week ago

monperrus/crawler-user-agents #374

Disambiguate http clients from crawlers/bots

I was surprised to find http clients like `python-requests`, `Go-http-client`, `wget`, `curl`, etc included in the crawler list. While I understand that these tools can be abused, in our case a large …

srstsavage updated 1 month ago

apify/crawlee-python #710

Add session cookies to crawling context

Add to the context, the cookie of the session from which the request was made, both for HTTP crawlers and Playwright

Mantisus updated 6 days ago

alaz/legitbot #107

Split Google crawlers into categories

https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot

alaz updated 1 month ago

hbz/nwbib #549

Disallow crawlers for test.nwbib.de

When implementing #386, we disallowed crawlers for the test system: ``` # robots.txt for test.nwbib.de User-agent: * Disallow: / ``` However, currently nothing is disallowed on test, see htt…

acka47 updated 4 days ago

ail-project/ail-framework #243

redis.exceptions.DataError: A tuple item must be str, int, f…

While running : > /ail-framework/bin/crawlers/Crawler.py Does this error related to lacus ? ``` Traceback (most recent call last): File "/root/ail-framework/AILENV/lib/python3.8/site…

sudeepbogati7 updated 2 weeks ago

notum-cz/strapi-next-monorepo-starter #10

feat: robots.txt File

## Ability to edit the robots.txt file The `robots.txt` file is a simple text file placed in the root directory of a website. It serves as a set of instructions for web crawlers (like those used b…

tocosastalo updated 3 weeks ago

mastodon/mastodon #28383

Block additional AI crawlers

### Pitch The default Mastodon robots.txt file already blocks GPTBot. I'd like to suggest that it should also block some of the other crawlers that scrape sites for data for AI training: ``` Us…

lazaruscorporation updated 1 month ago

apify/crawlee-python #708

Refactor handling of `ignored_http_status_codes` and `Sessio…

- Technical debt introduced in https://github.com/apify/crawlee-python/pull/167 - Should probably be ported over to JS afterwards (v4?) The reasoning behind the change was that some errors (such a…

janbuchar updated 7 hours ago

1000+ results for crawlers

1000+ results
for crawlers