DigitalPebble / behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Other
281 stars 60 forks source link

Elasticsearch module #50

Open lewismc opened 9 years ago

lewismc commented 9 years ago

Hi @jnioche I'm working on an ES search module as part of using Behemoth in an ongoing project. I'll send you a PR ASAP.

jnioche commented 9 years ago

Hi. You know about [https://github.com/DigitalPebble/behemoth-elasticsearch]? It is probably in need of an update but should be a good starting point.

lewismc commented 9 years ago

Nope I didn't even see this Julien. I've nearly finished this patch as well :( You want me to send a PR to add the elasticsearch module? Any reason you want to (or don't want to) have the ES module as part of the main codebase?

jnioche commented 9 years ago

I've added a link to [https://github.com/DigitalPebble/behemoth/wiki/Behemoth-Modules].

You want me to send a PR to add the elasticsearch module? yep, would be the right place for it and the easiest way to compare your version with the existing one.

Any reason you want to (or don't want to) have the ES module as part of the main codebase?

I decided not to have an ever-expanding list of modules in Behemoth per-se and be as decoupled and modular as possible. This also serves as an example of how to build a new resource for Behemoth with the Maven pom etc...

BTW would be interesting to hear about your project and how Behemoth fits in. There's a page for use case which you're welcome to contribute a short blurb to if you feel like it [https://github.com/DigitalPebble/behemoth/wiki/Users].

lewismc commented 9 years ago

I've added a link to [https://github.com/DigitalPebble/behemoth/wiki/Behemoth-Modules].

Great.

I decided not to have an ever-expanding list of modules in Behemoth per-se and be as decoupled and modular as possible. This also serves as an example of how to build a new resource for Behemoth with the Maven pom etc...

I just went through around 4 hours of debugging network issues and making upgrades to various dependencies in order to get that ElasticSearch component to work with the master Behemoth codebase. I am however able to persist data into most recent release of ES now and want to push this into the codebase so I will send you a PR. The issue I see here is that it is clear the Behemoth-elastic module is not being maintained as much (and not being released and/or synced with master) and therefore it is difficult to pick it up and hit the group running. It is up to you, however I would make an argument to you, that as one of many users of Behemoth, it would be great to see the ES module make it in to the core codebase. I very much take the point how it can serve as an example module though.

BTW would be interesting to hear about your project and how Behemoth fits in. There's a page for use case which you're welcome to contribute a short blurb to if you feel like it [https://github.com/DigitalPebble/behemoth/wiki/Users].

Yes I'll send you something right now. Please reply here with any thoughts on the above. Thanks Julien.

jnioche commented 9 years ago

I am however able to persist data into most recent release of ES now and want to push this into the codebase so I will send you a PR.

great

The issue I see here is that it is clear the Behemoth-elastic module is not being maintained as much (and not being released and/or synced with master)

it was kept separate also because it was less mature than the other components. It should be in sync with core - otherwise it would not compile at all. I take your point about having it released alongside the other modules though.

It is up to you, however I would make an argument to you, that as one of many users of Behemoth, it would be great to see the ES module make it in to the core codebase.

the many users of Behemoth have been very quiet in the last couple of years ;-) ES and SOLR are the main tools for search; I also use ES a lot on my various projects so yes, it would make sense to have it in the main repo alongside the other components.

I'll look at your PR before moving the code. BTW do you leverage [https://github.com/elastic/elasticsearch-hadoop] at all?