carrot2 / carrot2

Carrot2: Text Clustering Algorithms and Applications
http://www.carrot2.org
771 stars 208 forks source link

Upgrade Nutch plugin to use the 3.x release [CARROT-443] #648

Closed dweiss closed 13 years ago

dweiss commented 15 years ago

Related issue on Apache JIRA: https://issues.apache.org/jira/browse/NUTCH-673


Issue: CARROT-443 (migrated from JIRA), created by Stanisław Osiński (@stanislawosinski), 2 votes, resolved Jun 21 2011 Attachments: Clusterer.java, HitsClusterAdapter.java, TestClusterer.java Linked issues:

dweiss commented 14 years ago

Comment by Stanisław Osiński (@stanislawosinski) (migrated from JIRA)

While we're waiting for Lucene 2.9.1 to come out, maybe we would be able to handle this for 3.1.1?

dweiss commented 14 years ago

Comment by Dawid Weiss (@dweiss) (migrated from JIRA)

Investigated the possibilities here.

Nutch still has Lucene 2.9.x, whereas we use Lucene 3.0.0. Also, there will be a bunch of other libraries required to add Carrot2 3.0+ to Nutch, some of them heavy (Mahout, google collections, etc.). I don't know if Nutch folks will appreciate this much.

What do you think – should be try, or leave Nutch with 2.x line?

dweiss commented 14 years ago

Comment by Stanisław Osiński (@stanislawosinski) (migrated from JIRA)

I think the extra libraries wouldn't be more than 1 or 2 MB together, right? So the biggest problem seems Lucene – maybe we could schedule this at a point when Lucene is upgraded in Nutch? After all, upgrading from 2.9.x to 3.0.0 is only a matter of fixing deprecations. I don't see a relevant issue in Nutch's JIRA though.

dweiss commented 14 years ago

Comment by Dawid Weiss (@dweiss) (migrated from JIRA)

Older Lucene (2.9) is a show-stopper for this, unfortunately. There are API incompatibilities that cause exceptions at runtime. I'll file an issue with Nutch, perhaps they'll wish to upgrade and then we can proceed.

dweiss commented 14 years ago

Comment by Dawid Weiss (@dweiss) (migrated from JIRA)

Equivalent issue in Nutch: https://issues.apache.org/jira/browse/NUTCH-673

dweiss commented 14 years ago

Comment by Stanisław Osiński (@stanislawosinski) (migrated from JIRA)

We need to wait until Nutch upgrades to Lucene 3.0. Moving to 3.3.0 for the time being.

dweiss commented 14 years ago

Comment by Dawid Weiss (@dweiss) (migrated from JIRA)

Will upgrade after we release 3.4.0.

dweiss commented 13 years ago

Comment by Stanisław Osiński (@stanislawosinski) (migrated from JIRA)

Some rough-cuts Nutch integration code for Carrot2 3.x I once prepared for a client.

dweiss commented 13 years ago

Comment by Dawid Weiss (@dweiss) (migrated from JIRA)

Nutch doesn't come with a frontend anymore. Clustering plugin has been removed (and exists Solr which can be used as the sink from Nutch's crawls).