Closed d47081 closed 1 year ago
And I would to ask, do we need to enable the GitHub Discussions page, or do Issues to resolve, not talk.
Perhaps it would be better to chat and discuss the development in "Discussions", and use this section to solve existing (already implemented :)) problems, as well as consider user requests. I think it would be more traditional for GitHub.
Well, for this subject have implemented new feature that relates to the hostPage
.robotsPostfix
field in the database plus new configuration option available:
/*
* Permanent rules that append to the robots.txt if exists else CRAWL_ROBOTS_DEFAULT_RULES
* The crawler does not overwrite these rules
*
* Presets
* yggdrasil: /database/yggdrasil/host.robotsPostfix.md
*
*/
define('CRAWL_ROBOTS_POSTFIX_RULES', null); // string|null
In few words, we can append extra robots.txt rules in to the hostPage
.robotsPostfix
field, and these data will not be overwritten by the remote one, on auto-update.
For the white-blacklist needs we don't need the any of new features implementation, because can simply disable specific domain for it pages crawling and indexing in the host
.status
field.
And finally, to close this subject, I have created database configuration preset, where everyone can contribute the propositions. Because me using this engine for Yggdrasil network scanning, I have separated this registry into the relative folder (because engine could be used for other networks also)
https://github.com/YGGverse/YGGo/tree/main/database/yggdrasil
https://github.com/YGGverse/YGGo/tree/main/database/yggdrasil
just for a note, those data sets are depending of crawler configuration so have moved these variables to the manifest API, where each the application able to grab the data match to it specific requirements
I work on the distributed ecosystem, so for right now it's
<meta name="yggo" content="/yggo/api.php?action=manifest" />
This option could be enabled by node owner with API_ENABLED
+ API_MANIFEST_ENABLED
settings.
So, trackers with external seeders is shit inside the network
Nice start..
I mean this subject for the websites we need to crawl and some maybe a mirrors we need to block or limit by the
crawlPageLimit
/CRAWL_HOST_DEFAULT_PAGES_LIMIT
Ideas here, just few relevant relations https://github.com/YGGverse/YGGo/issues/1#issuecomment-1498137314
And I would to ask, do we need to enable the GitHub Discussions page, or do Issues to resolve, not talk.