Open sylvinus opened 11 years ago
Hey mate,
Are you by any chance working on this? I need this AWESOME library for a work gig where I need resume support.
For me there's also a lack of decent cache in this lib, what would be useful especially during the crawler development, where we are doing a multiple request to the same URL-s to implement DOM parsing functions.
The easy implementation for the cache could be allowing to build custom request object instead of default request = require('request')
. If this was pluggable we could use cached-request instead of the default one.
Good point! I'm always thinking about this, how can I debug easily during development avoid requesting target site time by time. While I dive deep into the code, I find it is complicate. Some of the problems are below:
Do you have any ideas of all the problems ?
Well, firstly I think these problems shouldn't be directly addressed by the crawler. When I think about each point from your list, and possible options that would be nice to have for each of them, my first thought is that will make the lib twice bigger and complicated.
In the other words - every specialized caching lib will be better in this job than any custom implementation made here, in the same way always the crawler will be better to be used in any app required crawling, than the custom implementation :)
I'd think, instead, of some simple layer to be injectable into the crawler, to manage all these problems. My first idea was about cached-request
, because this lib provides the same requests interface as the original nodejs request, and is transparently replaceable, probably with a one line of code. But if you want to do something more flexible, I'd think about somethink like providing Cache
interface (I'm sorry I think mostly in java) which is injectable to the crawler and can have different implementations.
So, if you don't provide as many configurable options as someone requirements, he can always implement this interface and replace the Cache
object. It's also possible to develop different implementations (like with mongodb storage, filesystem storage etc) by other developers and publish them as separate libs.
Having this done I'd give two implementation by default provided by the crawler lib, one would be a NoOpCache
which doesn't do caching at all and is the default one, while the second one would be a simple filesystem cache with minimum options (the most of my crawler usages, were one-time, maybe up to three-times crawling, get the data and forget about the code - I'd like to be able to use cache, but I wouldn't like to configure mongodb for that, for example). For this simple implementation I'd choose:
/tmp
instead.cached-request
does. I'd probably use MD5(URL + METHOD + POST BODY)
Cache
object.Hopefully this will be helpful :)
Nice job! A few quick questions:
I'm looking forward to your pull request :)
os.tmpdir()
Indeed I spend some time doing some open source contribution, I can add to my list :)
Aha, thank you so much!
Should be very easy to plug a mongodb / memcached cache