Closed danizen closed 6 years ago
Marking it as a feature request. In the meantime, you can also consider installing JEF Monitor and using its REST API (http://.../jefmon/suites/json) to pull statuses a central location (like JEF Monitor is doing). You can also create a dedicated web service yourself that can report on status, using com.norconex.jef4.status.JobSuiteStatusSnapshot
has the class which reads any job status.
Pascal, since submitting the pull request on the other issue, I've gone ahead and implemented a part of this based on an extension of the current MongoConnectionDetails
that I call in my code ActiveMongoConnectionDetails
. I don't think I'm up to a pull request on this one, but I am certainly up to actually implementing a working MongoJobStatusStore. Feel free to use the code in this gist in any way you need to. I've implemented it this weekend, so I think it is fair game.
We also bought a canadian board game called Gaia and broke it out with the kids. Thought of you, since I'd been implementing this just before.
Updated gist to correct tests, implement an index on the Mongo collection, and implement support for status.getProperties()
I think I can write a service that polls that uses jef to poll a number of file locations for statuses and places them in Mongo, but I may stop fighting the framework and just start using jef monitor as the monitoring application.
To do that, I'd have to update my collector life cycle listeners so that after stopping the collector, the Mongo collections are backed-up, but that may be easier to do than what I've been trying to do.
Sigh.
I'll try something like this in a collector life cycle listener:
if (otherStatusStore != null) {
JobSuite jobSuite = collector.getJobSuite();
for (ICrawlerConfig crawlerConfig : collector.getCollectorConfig().getCrawlerConfigs()) {
IJobStatus jobStatus = jobSuite.getJobStatus(crawlerConfig.getId());
// surround with try catch
otherStatusStore.write(collector.getId(), jobStatus);
}
}
// surround with try catch
Thread.sleep(someInterval);
That worked:
In [2]: db.status.find_one()
Out[2]:
{'_id': ObjectId('5a6124dfa0eac007ac51a015'),
'job_id': 'MedlinePlus Monitor',
'last_activity': datetime.datetime(2018, 1, 18, 22, 51, 11, 24000),
'note': None,
'progress': 0.0,
'resume_attempts': 0,
'status_type': 'latest',
'suite': 'MedlinePlus Test HTTP Collector'}
In [3]: db.status.find_one()
Out[3]:
{'_id': ObjectId('5a6124dfa0eac007ac51a015'),
'job_id': 'MedlinePlus Monitor',
'last_activity': datetime.datetime(2018, 1, 18, 22, 52, 11, 47000),
'note': None,
'progress': 0.0,
'resume_attempts': 0,
'start_time': datetime.datetime(2018, 1, 18, 22, 51, 10, 488000),
'status_type': 'latest',
'suite': 'MedlinePlus Test HTTP Collector'}
So, now I have contributed back a MongoJobStatusStore
, I can use it now, and can wait for it to be easy to use as long as needed.
@essiembre, I've updated to use IJobLifeCycleListener
so that the listener also gets notified that a job is stopping, as I had some races otherwise. I've also made some changes to the MongoJobStatusStore
implementation I shared so that it will also clear the stop_requested
property.
I'll update my Github gist to provide more later on.
Thanks for sharing. Would you say you've addressed your feature request then? Given the number of ways status can be stored is infinite, and in the spirit of limiting dependencies, I would close this ticket unless there are still items you would like addressed?
Oh.. and about that board game you mentioned, I never heard of it before, but I will look into it. Looks interesting!
Yes, I think it can be closed. If there were a feature that would be of interest to norconex-jef package, it would be an S3StatusStore implementation. That would be more general purpose, given that almost all the cloud storage implementations are S3 compatible, and allow running a norconex crawler with on-demand instances, while monitoring the progress using jef-monitor (or similar), with on-premise implementation.
S3 also follows the conventions of file system paths, so that the relationship between job suite and job follows the principle of least surprise.
Do you think I should file that as twin feature requests on jef and jef-monitor?
My need for Mongo is rather specific to my team's requirements.
I wish to write a
MongoJobStatusStore
that saves the current status to a MongoDB collection, which it creates as necessary. This facilitates a crawler that runs on a different system from a web interface that shows crawler status. In future, I may also want to write anS3ObjectStatusStore
so that the crawler can run on a spot requested AWS EC2 instance.It looks to me that it is brittle, and not-future proof, to monkey-patch the
JobSuiteConfig
in anICollectorLifeCycleListener
. It will work as the code is presently implemented, but it doesn't provide much leverage for other changes.The best way to customize the IJobStatusStore object is to implement a sub-class of
HttpCollector
, and possiblyFileCollector
if that becomes relevant. The implementation would:createJobSuite
methodcreateJobSuite
currently inAbstractCollector
IJobStatusStore
instance on theJobSuiteConfig
Is this the right way to do things or overkill? e.g. faced with similar requirements to use MongoDB (or an RDBMS) for the data store and progress store, which way would you do it.
Not urgent - I stick with what I have for now.