Closed danizen closed 4 years ago
You are welcome!
As you may have guessed, master now holds code for the next major release.
About Mongo + RDBMS: likely so. We are questionning whether to make these implementation separate from the main package (similar to committers) to reduce dependencies not needed for most and also to encourage people to share their own implementations.
About snapshot/recovery... yes, we plan to keep having this. The current implementation does not require editing a file. There is a "stop file", but that's part of its internals. Launching the -a stop
command should stop it. You can then perform a -a resume
.
I think my entire way of integrating norconex needs a rewrite as well. I think it will be better if instead of mucking with the Collector store and status store, I should make use of something like rclone to move this to something that gets collected to a central location. Is that what your company does on big jobs? You must have a way to centralize the status monitoring...
Thank you, BTW.
So, my plan is to change how I do this in future, as I indicated, in a Cloud world, positing a central RDBMS or Mongo is not optimal. I am being asked to scale up my solution, to see whether we can stop purchasing IBM Watson Discovery or LucidWorks Fusion (or other Insight engine), and simply use my knowledge of crawling and indexing. It is somewhat unclear what will happen ... continuing with an "Insight Engine" will get more expensive.
My plan is to simply assign a persistent EBS volume per crawl, and use an Apache Camel daemon to watch for file events on the JEF status files and move them to S3. On the other side, another Camel process can pull them down and populate a local directory.
My other use of Norconex Collectors were purpose built, and it made sense to build my own distribution. Now, if I am to scale up, I will need to stay as generic as possible so that the norconex collectors web documentation can be as helpful as possible.
I notice a lot of difference between master and the 1.10 release tag. It looks like someone, perhaps Google (he speculates), is keeping you pretty busy. I have some questions about your roadmap:
Thanks - and congratulations on Google donating the Google Cloud Search committer.