R-Sandor / FindFirst

Organizing the information that matters to you and your teams. The knowledge of your world.
https://findfirst.dev
Apache License 2.0
10 stars 17 forks source link

[Server] Before initaiting the scrape check robot.txt on the domain. #236

Closed R-Sandor closed 1 month ago

R-Sandor commented 1 month ago

Details

Resource: https://yoast.com/ultimate-guide-robots-txt/

Robots.txt syntax

robots.txt file consists of one or more blocks of directives, each starting with a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can have one block for all search engines, using a wildcard for the user-agent, or particular blocks for particular search engines. A search engine spider will always pick the block that best matches its name.

These blocks look like this (don’t be scared, we’ll explain below):

User-agent: * 
Disallow: / 

User-agent: Googlebot 
Disallow: 

User-agent: bingbot 
Disallow: /not-for-bing/ 

Directives like Allow and Disallow should not be case-sensitive, so it’s up to you to write them in lowercase or capitalize them. The values are case-sensitive, so /photo/ is not the same as /Photo/. We like capitalizing directives because it makes the file easier (for humans) to read.

joelramilison commented 1 month ago

Please assign me to this, currently working on it :)