fangfangli / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

Update to opennlp tools 1.5.0 and opennlp maxent 3.0.0 #180

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Can we update to the latest OpenNLP tools and maxent? Some reasons to do this:

* maxent 3.0.0 doesn't seem to depend on trove anymore, so we won't have to 
rebuild it to make our trove versions match.

* tools 1.5.0 allows segmenter, pos, etc. models to be loaded from an 
InputStream instead of just a File. This means we could actually support 
loading the models as resources from a jar file (I believe).

* maxent and tools are both available through maven, so we can stop hosting 
them ourselves: http://opennlp.sourceforge.net/README.html#maven

Thoughts?

Original issue reported on code.google.com by steven.b...@gmail.com on 5 Jan 2011 at 7:07

GoogleCodeExporter commented 9 years ago
Yes - you beat me to creating this ticket.  The license is also changed to ASL 
which is much better too.  

Long term, I still want to see work done on #40 and #41 to further reduce the 
amount of OpenNLP related code that we support.  

Also, I have a mind to submit a bug report about proper closing of files by 
RealValueFileEventStream (or whatever is not doing its job) to the new apache 
incarnation of the project.  

Re: trove - that needs to go anyways since its LGPL.  I believe cleartk-ml has 
a dependency on it that is unrelated to the OpenNLP stuff.  I will file a 
separate issue.

Original comment by pvogren@gmail.com on 5 Jan 2011 at 5:32

GoogleCodeExporter commented 9 years ago
re: trove - nevermind - this was removed in issue #163.  Thanks Philipp!

Original comment by pvogren@gmail.com on 5 Jan 2011 at 6:21

GoogleCodeExporter commented 9 years ago
I realize that I was slightly confused about this issue before.  You may be 
aware that opennnlp is now an incubator apache project and I was thinking these 
version numbers are what they use for the new incarnation.  Regardless, 
upgrading to the latest version should make it easier to migrate to the apache 
version when it comes out.  

Original comment by pvogren@gmail.com on 5 Jan 2011 at 10:58

GoogleCodeExporter commented 9 years ago
Yeah, I believe they have the source up at the apache incubator, but I don't 
think they've made a release there yet. They were only accepted on 24-Dec-2010, 
so presumably it'll be a little while before they have a release. In the 
meantime, I have to imagine that porting to the newest versions can only help 
when they finally produce an incubator release.

Original comment by steven.b...@gmail.com on 6 Jan 2011 at 12:14

GoogleCodeExporter commented 9 years ago
The APIs for OpenNLP have changed considerably from the previous version we 
were using.  They are actually much simpler now and so it will simplify our 
code considerably.  

I would like to propose that we consolidate OpenNLPTreebankParser and 
OpenNLPTaggerParser into a single class.  They seem very similar and have a lot 
of repeated code in them.  The only difference that I can tell is that one uses 
part-of-speech tags obtained from the CAS and the other lets the parser do the 
tagging.  I think it would be easy enough to provide a flag that allows for 
either option.  Does this make sense?

Original comment by pvogren@gmail.com on 13 Jan 2011 at 10:39

GoogleCodeExporter commented 9 years ago

Original comment by pvogren@gmail.com on 13 Jan 2011 at 10:40

GoogleCodeExporter commented 9 years ago
Yes, a flag that allows either CAS pos tags or parser POS tags sounds great to 
me.

(And yeah, I also noticed that the new OpenNLP APIs make things *a lot* 
cleaner.)

Original comment by steven.b...@gmail.com on 13 Jan 2011 at 10:52

GoogleCodeExporter commented 9 years ago
Sounds reasonable to me.

Original comment by phwetz...@google.com on 14 Jan 2011 at 1:16

GoogleCodeExporter commented 9 years ago
ok - while I was refactoring the opennlp wrappers to work with the latest 
versions I took the liberty to clean up the code, rename every class, and 
repackage some of the helper classes for the parser annotator.  For example, I 
renamed OpenNLPTreebankParser to ParserAnnotator.

I also fixed the broken "TokenRetriever" and "SentenceRetriever".  The code for 
both is now in InputTypesHelper which now allows you to actually use input 
sentence and token types from a different type system as it was originally 
intended to do.  I got a little confused with a generics issue that came up and 
so there's a hackish workaround which deserves its own issue.  

I also built a new test model and updated the tests to work with that instead.  

fixed in r2319

Original comment by pvogren@gmail.com on 14 Jan 2011 at 7:07