-
This ticket is a placeholder for general API rate and access limiting logic to better control the load placed on the service and provide options in case of system instability.
Rate limiting was men…
-
Hi all,
I was wondering if there were specific restrictions on web crawling certain sites?
For example if one tried to web crawl Medscape:
```python
from trafilatura.spider import focused_craw…
-
## Bug report
**Problem**
I am sending led data over wifi from Unity. It will work great for 2-3 minutes then hang up. Through some debugging, I found that every time effectmanager.h NextE…
-
Sitemaps are currently not supported. Implementing sitemap support might help the crawler with URL discovery on some sites.
There are some risks though. Some sitemaps are _huge_. Look at neocities'…
-
My log files for MRBS systems sites have exploded to
gigabytes in size in one day. They are filled with
errors like this:
[Mon Sep 11 09:12:26 2006] [error] [client
66.249.65.163] FastCGI:
> ser…
-
I've noticed that there are some sites that go to a page that says some iteration of "Access Denied" or "Verify you are a human." I think this is mostly caused by the VPN (i.e. the VPN IP address is b…
-
Query parameters and malformed URLs cause SEO headaches, even when canonical tags and crawling/indexing directives are used. From cache-misses to wasted crawl resources, to fragmentation and indexing …
-
The sitemap could use caching to help with large sites with search engines crawling.
-
Because of the nature of the project, bridges _will_ break for various reasons. Sites change, rate limits set in place, IPs get blocked, paywalls appear, HTML change slightly and so on. The issue trac…
-
https://techcrunch.com/2023/06/30/twitter-now-requires-an-account-to-view-tweets/
the nitter crawler will need to be recreated...