Norconex / collector-core

Collector-related code shared between different collector implementations
http://www.norconex.com/collectors/collector-core/
Apache License 2.0
7 stars 15 forks source link

Question - development roadmap #29

Closed danizen closed 4 years ago

danizen commented 4 years ago

I notice a lot of difference between master and the 1.10 release tag. It looks like someone, perhaps Google (he speculates), is keeping you pretty busy. I have some questions about your roadmap:

Thanks - and congratulations on Google donating the Google Cloud Search committer.

essiembre commented 4 years ago

You are welcome!

As you may have guessed, master now holds code for the next major release.

About Mongo + RDBMS: likely so. We are questionning whether to make these implementation separate from the main package (similar to committers) to reduce dependencies not needed for most and also to encourage people to share their own implementations.

About snapshot/recovery... yes, we plan to keep having this. The current implementation does not require editing a file. There is a "stop file", but that's part of its internals. Launching the -a stop command should stop it. You can then perform a -a resume.

danizen commented 4 years ago

I think my entire way of integrating norconex needs a rewrite as well. I think it will be better if instead of mucking with the Collector store and status store, I should make use of something like rclone to move this to something that gets collected to a central location. Is that what your company does on big jobs? You must have a way to centralize the status monitoring...

danizen commented 4 years ago

Thank you, BTW.

danizen commented 4 years ago

So, my plan is to change how I do this in future, as I indicated, in a Cloud world, positing a central RDBMS or Mongo is not optimal. I am being asked to scale up my solution, to see whether we can stop purchasing IBM Watson Discovery or LucidWorks Fusion (or other Insight engine), and simply use my knowledge of crawling and indexing. It is somewhat unclear what will happen ... continuing with an "Insight Engine" will get more expensive.

My plan is to simply assign a persistent EBS volume per crawl, and use an Apache Camel daemon to watch for file events on the JEF status files and move them to S3. On the other side, another Camel process can pull them down and populate a local directory.

My other use of Norconex Collectors were purpose built, and it made sense to build my own distribution. Now, if I am to scale up, I will need to stay as generic as possible so that the norconex collectors web documentation can be as helpful as possible.