YGGverse / YGGo

YGGo! Distributed Web Search Engine
MIT License
14 stars 3 forks source link

White list / black list websites, robots.txt pre-sets #5

Closed d47081 closed 1 year ago

d47081 commented 1 year ago

So, trackers with external seeders is shit inside the network

Nice start..

I mean this subject for the websites we need to crawl and some maybe a mirrors we need to block or limit by the crawlPageLimit/CRAWL_HOST_DEFAULT_PAGES_LIMIT

Ideas here, just few relevant relations https://github.com/YGGverse/YGGo/issues/1#issuecomment-1498137314

And I would to ask, do we need to enable the GitHub Discussions page, or do Issues to resolve, not talk.

ygguser commented 1 year ago

And I would to ask, do we need to enable the GitHub Discussions page, or do Issues to resolve, not talk.

Perhaps it would be better to chat and discuss the development in "Discussions", and use this section to solve existing (already implemented :)) problems, as well as consider user requests. I think it would be more traditional for GitHub.

d47081 commented 1 year ago

Well, for this subject have implemented new feature that relates to the hostPage.robotsPostfix field in the database plus new configuration option available:

/*
 * Permanent rules that append to the robots.txt if exists else CRAWL_ROBOTS_DEFAULT_RULES
 * The crawler does not overwrite these rules
 *
 * Presets
 * yggdrasil: /database/yggdrasil/host.robotsPostfix.md
 *
 */
define('CRAWL_ROBOTS_POSTFIX_RULES', null); // string|null

In few words, we can append extra robots.txt rules in to the hostPage.robotsPostfix field, and these data will not be overwritten by the remote one, on auto-update.

For the white-blacklist needs we don't need the any of new features implementation, because can simply disable specific domain for it pages crawling and indexing in the host.status field.

And finally, to close this subject, I have created database configuration preset, where everyone can contribute the propositions. Because me using this engine for Yggdrasil network scanning, I have separated this registry into the relative folder (because engine could be used for other networks also)

https://github.com/YGGverse/YGGo/tree/main/database/yggdrasil

d47081 commented 1 year ago

https://github.com/YGGverse/YGGo/tree/main/database/yggdrasil

just for a note, those data sets are depending of crawler configuration so have moved these variables to the manifest API, where each the application able to grab the data match to it specific requirements

I work on the distributed ecosystem, so for right now it's <meta name="yggo" content="/yggo/api.php?action=manifest" />

This option could be enabled by node owner with API_ENABLED + API_MANIFEST_ENABLED settings.