Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Administration GUI for collector-http crawler config #183

Open leonardsaers opened 8 years ago

leonardsaers commented 8 years ago

Are there any plans for creating an administration GUI for crawler configuration?

essiembre commented 8 years ago

Many plans, little time! ;-) Seriously, our internal wish list for our open-source offering is quite big but a crawler GUI is currently low on that list. I am marking this as a feature request.

leonardsaers commented 8 years ago

Yes, this is not a core feature. Maybe there are other open source project which could provide a GUI given a .xsd or .dtd file.

I found this project on git hub which may solve some part of the problem: https://github.com/davidmoten/xsd-forms

Maybe there are other project as well which can be of interest here.

essiembre commented 8 years ago

You can give it a try and report the kind of success you get, but the reason this cannot be an all-purpose solution is because the XML definition for the collector is not static. We cannot release a one-size-fits-all XSD or DTD. People can add their own classes with their own custom configurable XML to them. We want to keep that flexibility. There is also the support for Velocity directives that would not work well with that in some cases (would break all XML parsers if it has not been interpreted by Velocity first).

We could look into changing how configuration is implemented or maybe have each configurable class provide their DTD or something like that, but that's not planned. We want to keep being able to add your own classes as simple as possible, without much requirements.

One day maybe... :-) But anything you find that can help in the meantime, please share.

leonardsaers commented 8 years ago

Creating a GUI which solves the entire configuration challenge in a usable way is of course a really big task. But providing a usable GUI which solve some part of the configuration challenge may be possible by using other projects. I may take a deeper look at it.

danizen commented 7 years ago

I can envision an application that solves this by having 2-3 tables:

This is all linked with JEF Monitor so that all crawls are integrated.

danizen commented 7 years ago

So, I've thought more about this, and I'm thinking that a GUI is not the right way to go, at least not initially. It would be better to do this as a microservice, embedding collector-http. APIs allow to manipulate configurations, crawls based on configurations, and get status. Then, this can be integrated into the admin section of a GUI that uses the crawl. It also allows scaling as multiple collectors can be distributed to multiple hosts by a front-end.