-
I am developing a spider scraper using the `spider_py` library and encountering issues with the crawling depth functionality. The crawling depth behavior appears inconsistent across different sites.…
-
Google's AI crawler is `Google-Extended`
> Google-Extended is a standalone product token that web publishers can use to manage whether their sites help improve Gemini Apps and Vertex AI generative …
-
### Describe the problem to be solved
In order to make Peertube sites to be easier indexed by search engines we should provide more information about the videos in the sitemap.xml.
### Describe the …
-
It would be nice to be able to index work sites like Confluence, Jira, and internal documentation sites. Is there any way to configure the crawler with my login cookies? Or add the crawler as a plugin…
-
I have included several websites for testing, but no matter what questions are asked, the answers I receive generally mean that there is no relevant content context
![image](https://github.com/user-a…
-
# Sources
[PoE RSS](https://www.pathofexile.com/news/rss)
[Last Epoch RSS](https://forum.lastepoch.com/c/announcements/37.rss)
[Torchlight: Infinite Wiki (tlidb.com)](https://tlidb.com/#:~:text=Run…
-
## Ability to edit the robots.txt file
The `robots.txt` file is a simple text file placed in the root directory of a website. It serves as a set of instructions for web crawlers (like those used b…
-
We want to be able to obtain all web and media content associated with a specific list pre-identified domain names.
This issue tracks domain names identified in the [**BigScience Data Cataloging Ev…
-
We ran into an issue where a deploy preview from netlify was sticking around and showing up in search results. We don't really want that to happen so we should look at maybe adding a robots.txt or noI…
-
Maybe add other sites in the future/Rewrite crawling code to be more flexible
araml updated
4 years ago