ClearTK / cleartk

Machine learning components for Apache UIMA
http://cleartk.github.io/cleartk/
Other
129 stars 58 forks source link

restructure ClearTK-toolkit into smaller sub-projects by dependencies and/or functionality #158

Closed bethard closed 9 years ago

bethard commented 9 years ago

Original issue 160 created by ClearTK on 2010-09-09T15:18:28.000Z:

Philip:

I would like to propose a pretty significant change to the way our code is organized. Here are my motivations:

In yesterday's email I gave a quick summary of how I would like to see the framework project split up (see below). I think this will go a long way towards easing peoples fears of open source license entanglements as well is improve our ability to communicate what the implications are of using the various wrappers. This change seems like a slam dunk to me - so I won't spend time justifying why we need to do this unless objections are raised.

Similarly, I would like to see the ClearTK toolkit project split up into meaningful subprojects. I have a use case where I would like to use just the tokenizer and part-of-speech tagger in ClearTK. Should I really have to import all of ClearTK along with the mountain of dependencies to do this? no. Here's a sketch of how the projects might be split up:

-ClearTK-temporal

I think some of the code that is currently in the toolkit project should not survive the restructuring. The code in the ne package is crap and barely does anything, for example. All of the OpenNLP wrappers need to disappear or go into a separate project(s). Certainly, we could write our own sentence segmenter. The parser wrapper could be put in a subproject that depends on ClearTK-syntax. I have a wrapper for the Berkeley Parser - but it is GPL and needs to be a separate project that also depends on ClearTK-syntax. Jinho has a dependency parser that will almost certainly get a ClearTK/uima wrapper - if not be refactored to actually use ClearTK. So, ClearTK-dependency-parsing would be a logical subproject.

So, there are a lot of details I've glossed over. But does the rough sketch make sense?

Philipp:

we talked about this before, so you know that I basically agree with you. I think splitting up the framework by dependencies (especially on the ML libraries) is the obvious thing to do. I'm not entirely sure how the toolkit should be split up. On the one hand, I think we also want to isolate major dependencies from each other where we can, but it also makes sense to group things by task. I'm not entirely convinced by your proposed split below; for example, it would seem that the CoNLL'05 collection reader should then go into "corpora", but it could just as well go into "semantic roles", because it's so tightly linked to that task. And if we group collection readers in "corpora", why is there another set of collection readers in "util"?

As an alternative / modification, how about a "collection readers" package, which contains only the generic collection readers (like FileSystem), which are general purpose and don't depend on any specific types; separate "tokens", "semantic roles", "syntax", "temporal", ... packages for the separate tasks, each of which may contain type system additions, and collection readers that are either specific to the task (e.g CoNLL'05), or they require the task's type system to function (e.g. Treebank in "syntax"), but these packages only contain ClearTK implementations; and then packages for each of the major dependencies (such as OpenNLP), which may contain components for multiple tasks (and thus may depend on the task-specific ClearTK packages).

Philip: There are two ways that dependencies can be a pain - license incompatibilities and version incompatibilities. If I download a package that is dependent on the universe and all of its dependencies are out-of-date with ones I have already added to my package - this is really annoying. But this is less of a show-stopper than having licensing issues with dependencies. I want it to be easy for people to download and use ClearTK without worrying about dependencies with other licenses. Code that depends on an LGPL library for example, will have to be moved into its own library. However, if we simply structure sub-projects based on dependencies we will end up with something rather unintuitive and difficult to navigate I think. So, there will be some art to how we restructure everything and we will just have to discuss it when we get started on it.

bethard commented 9 years ago

Comment #1 originally posted by ClearTK on 2010-12-30T21:00:38.000Z:

I have made good progress on this issue this week. I have a first pass of the toolkit project reorg complete now. All the projects are in place, the code is compiling, and the tests are passing. I still have a laundry list of things I need to take care of before I merge this branch back into trunk. So, now is a good time to take a look at this branch of give me some feedback. There will still be opportunities to rearrange things after its moved back to trunk too. I would like to do the merge and create a new cleartk release next week.

bethard commented 9 years ago

Comment #2 originally posted by ClearTK on 2010-12-31T00:06:11.000Z:

Seems to be some problems still. Loading into a new workspace, I get some errors like:

Missing artifact edu.umass.cs.mallet:grmm-mod:jar:0.1.3:compile Project 'cleartk-util' is missing required source folder: 'target/generated-sources/jcasgen'

I think the former may be an issue of not listing the cleartk repository? The latter goes away if I add the specified folder, but probably we should do something so it works out of the box.

Aside from that, everything looks basically okay to me.

One request though: could you change cleartk-temporal to cleartk-timeml? That package now includes stuff for both events and temporal relations so "temporal" is kind of a misnomer. I plan to reorganize those classes a bit when you're done with the reorg - all the Event* classes should be in a package org.cleartk.timeml.event, and all the VerbClause* classes should be in a package org.cleartk.timeml.tlink.

bethard commented 9 years ago

Comment #3 originally posted by ClearTK on 2010-12-31T04:14:39.000Z:

I'm not sure how to make the "missing artifact" problem go away. Here is another way to get it to compile that may be helpful (from Marshall Schor):

Using m2eclipse Eclipse plugin for maven - by default it will "miss" these generated directories the first time you import a project as a Maven project. However, the recovery is simple, and only needs doing once: right click the project and select Maven -> update project configuration.

It would be nice if one didn't have to do it once - this is a pretty tricky detail that is easy to miss.

I can change the project name.

Also, it just occurred to me that it might make sense to merge cleartk-test-util and cleartk-util. I can't think of any good reason why we should maintain them separately. I think every other project depends on both.

bethard commented 9 years ago

Comment #4 originally posted by ClearTK on 2010-12-31T05:00:55.000Z:

Update project configuration doesn't help. Again, I think the problem is that the pom for cleartk-ml-grmm is missing the cleartk repo declaration.

One reason to keep cleartk-test-util separate would be that as it is, we can declare it as a "test" dependency. If we merge the two, then we have to declare it as a "compile" dependency, and since it depends on "junit", we'll have to pull in "junit" as a "compile" dependency too.

bethard commented 9 years ago

Comment #5 originally posted by ClearTK on 2010-12-31T21:19:56.000Z:

Ok. The merge is complete. All the code is compiling without warnings and the tests are running both from maven on the command line and from within eclipse. I renamed cleartk-temporal to cleartk-timeml but otherwise didn't make any additional changes to that project. I didn't make time to consider your point about mergine cleartk-test-util with cleartk-util. More next week, er next year.

bethard commented 9 years ago

Comment #6 originally posted by ClearTK on 2011-01-05T18:31:44.000Z:

I am going to close this issue. The dangling issue with possibly merging cleartk-test-util and cleartk-util can be raised in another issue if we decide this is a useful thing to do (and it seems that it is not.) Any thing else related to the restructuring of the former project ClearTK-toolkit can be addressed in a separate issue.

bethard commented 9 years ago

Comment #7 originally posted by ClearTK on 2011-01-05T18:34:51.000Z:

Issue 54 has been merged into this issue.

bethard commented 9 years ago

Comment #8 originally posted by ClearTK on 2011-01-14T22:33:02.000Z:

<empty>