Norconex / collector-core

Collector-related code shared between different collector implementations
http://www.norconex.com/collectors/collector-core/
Apache License 2.0
7 stars 15 forks source link

Question - design review on custom IJobStatusStore implementations #14

Closed danizen closed 6 years ago

danizen commented 6 years ago

I wish to write a MongoJobStatusStore that saves the current status to a MongoDB collection, which it creates as necessary. This facilitates a crawler that runs on a different system from a web interface that shows crawler status. In future, I may also want to write an S3ObjectStatusStore so that the crawler can run on a spot requested AWS EC2 instance.

It looks to me that it is brittle, and not-future proof, to monkey-patch the JobSuiteConfig in an ICollectorLifeCycleListener. It will work as the code is presently implemented, but it doesn't provide much leverage for other changes.

The best way to customize the IJobStatusStore object is to implement a sub-class of HttpCollector, and possibly FileCollector if that becomes relevant. The implementation would:

Is this the right way to do things or overkill? e.g. faced with similar requirements to use MongoDB (or an RDBMS) for the data store and progress store, which way would you do it.

Not urgent - I stick with what I have for now.

essiembre commented 6 years ago

Marking it as a feature request. In the meantime, you can also consider installing JEF Monitor and using its REST API (http://.../jefmon/suites/json) to pull statuses a central location (like JEF Monitor is doing). You can also create a dedicated web service yourself that can report on status, using com.norconex.jef4.status.JobSuiteStatusSnapshot has the class which reads any job status.

danizen commented 6 years ago

Pascal, since submitting the pull request on the other issue, I've gone ahead and implemented a part of this based on an extension of the current MongoConnectionDetails that I call in my code ActiveMongoConnectionDetails. I don't think I'm up to a pull request on this one, but I am certainly up to actually implementing a working MongoJobStatusStore. Feel free to use the code in this gist in any way you need to. I've implemented it this weekend, so I think it is fair game.

We also bought a canadian board game called Gaia and broke it out with the kids. Thought of you, since I'd been implementing this just before.

danizen commented 6 years ago

Updated gist to correct tests, implement an index on the Mongo collection, and implement support for status.getProperties()

danizen commented 6 years ago

I think I can write a service that polls that uses jef to poll a number of file locations for statuses and places them in Mongo, but I may stop fighting the framework and just start using jef monitor as the monitoring application.

To do that, I'd have to update my collector life cycle listeners so that after stopping the collector, the Mongo collections are backed-up, but that may be easier to do than what I've been trying to do.

Sigh.

danizen commented 6 years ago

I'll try something like this in a collector life cycle listener:

    if (otherStatusStore != null) {
        JobSuite jobSuite = collector.getJobSuite();
        for (ICrawlerConfig crawlerConfig : collector.getCollectorConfig().getCrawlerConfigs()) {
           IJobStatus jobStatus = jobSuite.getJobStatus(crawlerConfig.getId());
           // surround with try catch
           otherStatusStore.write(collector.getId(), jobStatus);
       }
   }
   // surround with try catch
   Thread.sleep(someInterval);
danizen commented 6 years ago

That worked:

In [2]: db.status.find_one()
Out[2]:
{'_id': ObjectId('5a6124dfa0eac007ac51a015'),
 'job_id': 'MedlinePlus Monitor',
 'last_activity': datetime.datetime(2018, 1, 18, 22, 51, 11, 24000),
 'note': None,
 'progress': 0.0,
 'resume_attempts': 0,
 'status_type': 'latest',
 'suite': 'MedlinePlus Test HTTP Collector'}

In [3]: db.status.find_one()
Out[3]:
{'_id': ObjectId('5a6124dfa0eac007ac51a015'),
 'job_id': 'MedlinePlus Monitor',
 'last_activity': datetime.datetime(2018, 1, 18, 22, 52, 11, 47000),
 'note': None,
 'progress': 0.0,
 'resume_attempts': 0,
 'start_time': datetime.datetime(2018, 1, 18, 22, 51, 10, 488000),
 'status_type': 'latest',
 'suite': 'MedlinePlus Test HTTP Collector'}

So, now I have contributed back a MongoJobStatusStore, I can use it now, and can wait for it to be easy to use as long as needed.

danizen commented 6 years ago

@essiembre, I've updated to use IJobLifeCycleListener so that the listener also gets notified that a job is stopping, as I had some races otherwise. I've also made some changes to the MongoJobStatusStore implementation I shared so that it will also clear the stop_requested property.

I'll update my Github gist to provide more later on.

essiembre commented 6 years ago

Thanks for sharing. Would you say you've addressed your feature request then? Given the number of ways status can be stored is infinite, and in the spirit of limiting dependencies, I would close this ticket unless there are still items you would like addressed?

essiembre commented 6 years ago

Oh.. and about that board game you mentioned, I never heard of it before, but I will look into it. Looks interesting!

danizen commented 6 years ago

Yes, I think it can be closed. If there were a feature that would be of interest to norconex-jef package, it would be an S3StatusStore implementation. That would be more general purpose, given that almost all the cloud storage implementations are S3 compatible, and allow running a norconex crawler with on-demand instances, while monitoring the progress using jef-monitor (or similar), with on-premise implementation.

S3 also follows the conventions of file system paths, so that the relationship between job suite and job follows the principle of least surprise.

Do you think I should file that as twin feature requests on jef and jef-monitor?

My need for Mongo is rather specific to my team's requirements.