brave / web-discovery-project

Web Discovery Project
Mozilla Public License 2.0
52 stars 19 forks source link

Donate crawl data to the Internet Archive #353

Open upintheairsheep opened 11 months ago

upintheairsheep commented 11 months ago

Hello, this is more related to Brave Search itself. but can you get in contact with the Internet Archive and donate crawl data to the wayback machine? Alexa Internet did that until it's disintegration in 2020 by Amazon. The wayback machine is an extremely useful resource that is used all across the world by researchers, journalists, and basically anyone on YouTube doing an investigation related to something online, like the origin of an urban legend, for instance. Since you already have the Wayback Machine integrated into the browser, the chance of a link completely lost to time should decrease if you donate the crawl data. The crawl data donated by Brave would be extremely helpful, and ask the Archive staff to give you a list of all archived URLs on the wayback machine, deduplicate them, and add the links that are both not crawled by Brave and are still up to the search results, to make a third search engine to rival Google and Bing. Other good sources of links could be https://ODCrawler.xyz , and many AI image datasets.

andreas-hartmann commented 4 months ago

I also think the dataset generated by this project should become public or be linked to an existing public web crawler project instead of creating another walled garden index.