chrisakroyd / robots-txt-parser

A lightweight robots.txt parser for Node.js with support for wildcards, caching and promises.
MIT License
12 stars 8 forks source link

sitemaps links are not returned for some websites (ex: https://www.sainsburys.co.uk) #17

Open kvamsij opened 1 month ago

kvamsij commented 1 month ago

when you try to get sitemap for a website like https://www.sainsburys.co.uk it returns an empty array. But i have checked https://www.sainsburys.co.uk/robots.txt. The sitemap url exists in robots.txt.

So I did a little digging and found out the server was denying the request. The response was this.

`https://www.sainsburys.co.uk/robots.txt

Access Denied

Access Denied

You don't have permission to access "http://www.sainsburys.co.uk/robots.txt" on this server.

Reference #18.878f7b5c.1723733159.22628e9

https://errors.edgesuite.net/18.878f7b5c.1723733159.22628e9

` I can see that there were no headers added when requesting respective robots.txt url. So I added headers following headers in the get.concat and it worked for me. ` headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.9', 'Accept-Encoding': 'gzip, deflate, br' }` I'll be happy to contribute. As it is a small change. Regards, Vamse.