crawling-sites Search Results

spider-rs/spider-py #9

Inconsistent Crawling Behavior with Specified Depth in Spide…

I am developing a spider scraper using the `spider_py` library and encountering issues with the crawling depth functionality. The crawling depth behavior appears inconsistent across different sites.…

HarshJa1n updated 1 month ago

spyglass-search/spyglass #306

Crawling private sites that require login

It would be nice to be able to index work sites like Confluence, Jira, and internal documentation sites. Is there any way to configure the crawler with my login cookies? Or add the crawler as a plugin…

haydenflinner updated 1 year ago

arcjet/well-known-bots #13

Analyze traffic to see if Google ever sends Google-Extended

Google's AI crawler is `Google-Extended` > Google-Extended is a standalone product token that web publishers can use to manage whether their sites help improve Gemini Apps and Vertex AI generative …

davidmytton updated 2 months ago

notum-cz/strapi-next-monorepo-starter #10

feat: robots.txt File

## Ability to edit the robots.txt file The `robots.txt` file is a simple text file placed in the root directory of a website. It serves as a set of instructions for web crawlers (like those used b…

tocosastalo updated 3 weeks ago

AyronK/arpg-timeline #63

Automation for season updates

# Sources [PoE RSS](https://www.pathofexile.com/news/rss) [Last Epoch RSS](https://forum.lastepoch.com/c/announcements/37.rss) [Torchlight: Infinite Wiki (tlidb.com)](https://tlidb.com/#:~:text=Run…

AyronK updated 2 months ago

bigscience-workshop/data_tooling #298

Crawling curated list of sites: BigScience catalog app URLs

We want to be able to obtain all web and media content associated with a specific list pre-identified domain names. This issue tracks domain names identified in the [**BigScience Data Cataloging Ev…

yjernite updated 2 years ago

bloom-housing/bloom #3991

Non-Production Sites Getting Indexed

We ran into an issue where a deploy preview from netlify was sticking around and showing up in search results. We don't really want that to happen so we should look at maybe adding a robots.txt or noI…

YazeedLoonat updated 5 months ago

BeaconCMS/beacon #219

Webfiles and Metatags: Generate `robots.txt`

Generate robots.txt for sites. Each site will have its own robots.txt which must be resolved dynamically by adding a route `/robots.txt` to https://github.com/BeaconCMS/beacon/blob/7790eb72769a026c…

AZholtkevych updated 1 month ago

WICG/proposals #182

Subresource Reporting

## Introduction Complex web application often need to keep tabs of the subresources that they download, for security purposes. In particular, upcoming industry standards and best practices (e.g.…

yoavweiss updated 1 week ago

privacy-tech-lab/gpc-web-crawler #136

Design protocol for determining crawl accuracy over time

@katehausladen provided some initial analysis accuracy analysis as shown in [our draft paper](https://drive.google.com/open?id=1lkE6BdyVFfmE2fdPvDWKkvLcr6Rk8wqV&usp=drive_fs) (section 3.5). Starting w…

SebastianZimmeck updated 3 weeks ago

1000+ results for crawling-sites

1000+ results
for crawling-sites