Follow robots.txt yes/no

fergiemcdowall / norch-fetch

Fetch pure HTML from a webserver and save it to disk

MIT License

8 stars 2 forks source link

Follow robots.txt yes/no #9

Open eklem opened 10 years ago

eklem commented 10 years ago

-f --followrobotstxt <yes/no> if you want your fetcher to play nice or not

eklem commented 10 years ago

I guess there are two things to check for. 1: User agent and if it matches specific or * is used. 2: Make an array of parts of site to not follow and check each link that the crawler wants to follow against this array

eklem commented 10 years ago

And default to "yes". The user-agent string connects to this, but it's not necessary to develope this one. https://github.com/fergiemcdowall/norch-fetch/issues/10