bda-research / node-crawler

Web Crawler/Spider for NodeJS + server-side jQuery ;-)
MIT License
6.7k stars 875 forks source link

Pluggable cache #33

Open sylvinus opened 11 years ago

sylvinus commented 11 years ago

Should be very easy to plug a mongodb / memcached cache

jksdua commented 11 years ago

Hey mate,

Are you by any chance working on this? I need this AWESOME library for a work gig where I need resume support.

l0co commented 5 years ago

For me there's also a lack of decent cache in this lib, what would be useful especially during the crawler development, where we are doing a multiple request to the same URL-s to implement DOM parsing functions.

The easy implementation for the cache could be allowing to build custom request object instead of default request = require('request'). If this was pluggable we could use cached-request instead of the default one.

mike442144 commented 5 years ago

Good point! I'm always thinking about this, how can I debug easily during development avoid requesting target site time by time. While I dive deep into the code, I find it is complicate. Some of the problems are below:

  1. In what format when storing request and response?
  2. Where to store? Database or file system?
  3. When to clean the cached data?
  4. What should be used as identity for a request?
  5. What if a timestamp or similar query string?
  6. How to turn on/off the cache gracefully?
  7. How to manage the options for cache?
  8. How to construct the code to keep it clean and simple?

Do you have any ideas of all the problems ?

l0co commented 5 years ago

Well, firstly I think these problems shouldn't be directly addressed by the crawler. When I think about each point from your list, and possible options that would be nice to have for each of them, my first thought is that will make the lib twice bigger and complicated.

In the other words - every specialized caching lib will be better in this job than any custom implementation made here, in the same way always the crawler will be better to be used in any app required crawling, than the custom implementation :)

I'd think, instead, of some simple layer to be injectable into the crawler, to manage all these problems. My first idea was about cached-request, because this lib provides the same requests interface as the original nodejs request, and is transparently replaceable, probably with a one line of code. But if you want to do something more flexible, I'd think about somethink like providing Cache interface (I'm sorry I think mostly in java) which is injectable to the crawler and can have different implementations.

So, if you don't provide as many configurable options as someone requirements, he can always implement this interface and replace the Cache object. It's also possible to develop different implementations (like with mongodb storage, filesystem storage etc) by other developers and publish them as separate libs.

Having this done I'd give two implementation by default provided by the crawler lib, one would be a NoOpCache which doesn't do caching at all and is the default one, while the second one would be a simple filesystem cache with minimum options (the most of my crawler usages, were one-time, maybe up to three-times crawling, get the data and forget about the code - I'd like to be able to use cache, but I wouldn't like to configure mongodb for that, for example). For this simple implementation I'd choose:

  1. In what format when storing request and response? Raw binary data received from the sever.
  2. Where to store? Database or file system? File system.
  3. When to clean the cached data? Never. Put it to /tmp instead.
  4. What should be used as identity for a request? Check what cached-request does. I'd probably use MD5(URL + METHOD + POST BODY)
  5. What if a timestamp or similar query string? Same cache entry.
  6. How to turn on/off the cache gracefully? By choosing the cache implementation during the crawler setup.
  7. How to manage the options for cache? The crawler config should consume already created and configured Cache object.
  8. How to construct the code to keep it clean and simple? In the way I've described above.

Hopefully this will be helpful :)

mike442144 commented 5 years ago

Nice job! A few quick questions:

  1. Binary data file is not human friendly, but it's fine
  2. Fine
  3. How about on Windows?

I'm looking forward to your pull request :)

l0co commented 5 years ago
  1. In os.tmpdir()

Indeed I spend some time doing some open source contribution, I can add to my list :)

mike442144 commented 5 years ago

Aha, thank you so much!