chrisakroyd / robots-txt-parser

A lightweight robots.txt parser for Node.js with support for wildcards, caching and promises.
MIT License
12 stars 8 forks source link

Bug (in `useRobotsFor()`?) causes `canCrawl()` to sometimes return incorrect result #5

Open Trott opened 2 years ago

Trott commented 2 years ago

canCrawl() wrongly returns true no matter what in some situations. Example:

const robotsParser = require('robots-txt-parser')
const robots = robotsParser()

const test = async () => {
  await robots.useRobotsFor('https://google.com/')

  console.log(await robots.canCrawl('https://google.com/search'))
}

(async () => {
  await test() // Returns true, the wrong answer
  await test() // Returns false, the right answer
})()

I haven't dived into what's causing this weirdness, but I'm guessing the problem might be that active only gets set if the link is in the cache and not when it is freshly fetched here: https://github.com/chrisakroyd/robots-txt-parser/blob/a510f1a265b6dfa5901d4400fdc77ae8dde152d6/src/robots.js#L74-L92