bda-research / node-crawler

Web Crawler/Spider for NodeJS + server-side jQuery ;-)
MIT License
6.69k stars 876 forks source link

queue method should be a promise #467

Closed CristianMR closed 7 months ago

CristianMR commented 8 months ago

When skipDuplicates is set to true, the 'drain' event is emitted before it's checked whether queue(uri) has been seen. The solution is quite straightforward: provide a promise to determine whether queue(uri) has been resolved. This would allow us to await c.queue(uri) inside the callback function before calling done().

Crawler.prototype.queue = function queue(options) {
  var self = this;

  // Did you get a single object or string? Make it compatible.
  options = _.isArray(options) ? options : [options];

  options = _.flattenDeep(options);

  const promises = options.map((option) => {
    if (self.isIllegal(option)) {
      log('warn', 'Illegal queue option: ', JSON.stringify(option));
      return;
    }
    return self._pushToQueue(
      _.isString(option) ? { uri: option } : option
    );
  });

  return Promise.all(promises);
};
Crawler.prototype._pushToQueue = function _pushToQueue(options) {
  // ...
  // just return the promise
  return self.seen.exists(options, options.seenreq).then(rst => {
    if (!rst) {
      self._schedule(options);
    }
  }).catch(e => log('error', e));
};
mike442144 commented 7 months ago

Look, I totally understand, but it'll break the current API using, also if one does not await queue, 'unhandled promise' warning will always be there. What's worse, it to confuse the API when providing promise and callback at the same time. To be hoest, it is better to deduplicate outside the crawler, which means should be handled by the developer. That keeps the flexibility and consistency. Hope it helpful.

CristianMR commented 7 months ago

Thanks for your answer Mike. I already did it. It took some hours to find out this issue so probably others will do it too. Have a nice year btw ✨

mike442144 commented 7 months ago

Sorry to hear that you spent hours on this issue, so let's keep the issue details here to help others. Thanks, and the same to you, have a nice year.