bopoda / robots-txt-parser

PHP class for parse all directives from robots.txt files according to specifications
http://robots.jeka.by
MIT License
44 stars 17 forks source link

Check valid content #24

Open LeMoussel opened 7 years ago

LeMoussel commented 7 years ago

For optimize parsing, a first thing to do is to check if content is validated. For example Google check this HTML content invalid. <!DOCTYPE html PUBLIC "-//w3c//dtd html 4.0 transitional//en">

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
        <meta name="Description" content="Test Google robots.txt">
        <title>Test Google robots.txt...</title>
    </head
    <body>
        <p>
        User-agent:Cocon.Se Crawler<br />
        Disallow: /<br />
        User-agent:*<br />
        Disallow:<br />
        </p>
    </body>
</html>

Source: http://3.test.cocon.se/robots.txt

One solution would be to check if the content starts with a robots.txt directive. If not content is invalid and can't be parse.

bopoda commented 7 years ago

Just small remark: http://joxi.net/a2Xan7bC1Zvx5A If trust google robots.txt Tester, google parses those robots.txt :)

LeMoussel commented 7 years ago

For my test site (http://3.test.cocon.se/robots.txt), Google don't parse those robots.txt. In Search Console - robots.txt Tester, I have an 404 error.

image

I agree with you, if you copy HTML code image