Packaging refactor - Githubissues

briantopping commented 9 years ago

Hiyas, I'd like to create a PR for packaging and wanted to see if it would be accepted before spending time on it. I'm still stuck on e0238ceb given epic-parser-en-span_2.11/2015.2.19 being incompatible with anything newer, so I could just as easily make the changes I need locally and remain forked.

What I'm after at the minimum is to create two modules, one for "core" and one for "tools". Goal here is to get Tika out of the core dependencies, used only in epic.preprocess.TextExtractor, which is really a command-line tool. I don't know how many other tools there are like this or what other effects it might have on the dependency closure, but I think it will be significant.

The reason I am even doing that is Tika depends on Apache POI and a kitchen sink of other detritus. POI has a split-package problem. Once everything is cleaned up, I should at least be able to make the core module into an OSGi bundle.

dlwh commented 9 years ago

That'd be great! On Jul 25, 2015 11:12 AM, "Brian Topping" notifications@github.com wrote:

Hiyas, I'd like to create a PR for packaging and wanted to see if it would be accepted before spending time on it. I'm still working from e0238ce https://github.com/dlwh/epic/commit/e0238ceb16fc9adb9511240638357e8c44200a2f given a lack of epic-parser-en-span_2.11 being regenerated since then, so I could just as easily make the changes I need locally.

What I'm after at the minimum is to create two modules, one for "core" and one for "tools". Goal here is to get Tika out of the core dependencies, used only in epic.preprocess.TextExtractor, which is really a command-line tool. I don't know how many other tools there are like this or what other effects it might have on the dependency closure, but I think it will be significant.

The reason I am even doing that is Tika depends on Apache POI and a kitchen sink of other detritus. POI has a split-package http://wiki.osgi.org/wiki/Split_Packages problem. Once everything is cleaned up, I should at least be able to make the core module into an OSGi bundle.

— Reply to this email directly or view it on GitHub https://github.com/dlwh/epic/issues/38.

briantopping commented 9 years ago

Awesome! Any chance I could twist your arm to schedule a rebuild of the parsers?

dlwh commented 9 years ago

The current parser (2015.2.19) works against master. On Jul 25, 2015 12:18 PM, "Brian Topping" notifications@github.com wrote:

Awesome! Any chance I could twist your arm to schedule a rebuild of the parsers?

— Reply to this email directly or view it on GitHub https://github.com/dlwh/epic/issues/38#issuecomment-124874728.

briantopping commented 9 years ago

Ohhh, interesting, ok I'll try it again, thanks

dlwh commented 9 years ago

I'd prefer anything with "Train" or "Pipeline" be left in core, along with anything in corpora. Basically I consider "training" code to be core.

Actually, ideally, I'd rather set it up in such a way that all of the main classes are in core, and Tika is an optional dependency that enables content extraction from things other than plain text.

That seems like the best of all worlds?

On Sat, Jul 25, 2015 at 1:11 PM, Brian Topping notifications@github.com wrote:

I think these are all most of the main methods. Are there any in this list that shouldn't move to 'epic-tools'?

epic.corpora  (2 usages found)
    CONLLSequenceReader.scala  (1 usage found)
        (70: 3) def main(args: Array[String]) {
    MascUtil.scala  (1 usage found)
        (31: 3) def main(args: Array[String]) {
epic.features  (1 usage found)
    HackyHeadFinderTest.scala  (1 usage found)
        (15: 3) def main(args: Array[String]) {
epic.parser  (2 usages found)
    ParserPipeline.scala  (1 usage found)
        (102: 3) def main(args: Array[String]) {
    ParserTester.scala  (1 usage found)
        (45: 3) def main(args: Array[String]) {
epic.parser.kbest  (1 usage found)
    KBestParseTreebank.scala  (1 usage found)
        (34: 3) def main(args: Array[String]) = {
epic.parser.models  (1 usage found)
    ParserTrainer.scala  (1 usage found)
        (174: 3) def main(args: Array[String]):Unit = {
epic.parser.projections  (2 usages found)
    ConstraintAnchoring.scala  (1 usage found)
        (376: 3) def main(args: Array[String]) {
    OracleParser.scala  (1 usage found)
        (185: 3) def main(args: Array[String]) {
epic.preprocess  (5 usages found)
    MLSentenceSegmenter.scala  (1 usage found)
        (371: 3) def main(args: Array[String]):Unit = {
    SentenceSegmenter.scala  (1 usage found)
        (24: 3) def main(args: Array[String]):Unit = {
    StreamSentenceSegmenter.scala  (1 usage found)
        (53: 3) def main(args: Array[String]) {
    Textify.scala  (1 usage found)
        (12: 3) def main(args: Array[String]): Unit= {
    TreebankTokenizer.scala  (1 usage found)
        (62: 3) def main(args: Array[String]) = {
epic.sentiment  (2 usages found)
    SentimentEvaluator.scala  (1 usage found)
        (129: 3) def main(args: Array[String]) {
    SentimentTreebankPipeline.scala  (1 usage found)
        (38: 3) def main(args: Array[String]):Unit = {
epic.sequences  (4 usages found)
    SemiNERPipeline.scala  (2 usages found)
        (34: 3) def main(args: Array[String]) {
        (124: 3) def main(args: Array[String]) {
    TrainPosTagger.scala  (2 usages found)
        (17: 3) def main(args: Array[String]) {
        (42: 3) def main(args: Array[String]) {
epic.trees  (1 usage found)
    TraceToSlashCategoryConverter.scala  (1 usage found)
        (62: 3) def main(args: Array[String]):Unit = {
epic.trees.util  (1 usage found)
    FilterTreesByLength.scala  (1 usage found)
        (17: 3) def main(args: Array[String]) = {
epic.util  (1 usage found)
    ProcessTextMain.scala  (1 usage found)
        (29: 3) def main(args: Array[String]) = {

— Reply to this email directly or view it on GitHub https://github.com/dlwh/epic/issues/38#issuecomment-124883064.

briantopping commented 9 years ago

Ok, I'm pretty far along with this already, so let me take a look at it. I think you'd be surprised that the workflow is not disrupted by these kinds of changes, if you are importing via sbt, you'll get both projects as a transitive. If you are working in the IDE, you can still just right-click on the main method and run it, same results. Both projects will load into the IDE transparently. Maybe you could try it and give specific feedback?

dlwh commented 9 years ago

I mostly run as part of an assembly jar. So as long as there's a straightforward way to build an assembly that will be ok.

In addition, the train and pipeline main methods being in the core jar is important. Everything else is negotiable.

Thanks!

-- David

On Sat, Jul 25, 2015 at 2:22 PM, Brian Topping notifications@github.com wrote:

Ok, I'm pretty far along with this already, so let me take a look at it. I think you'd be surprised that the workflow is not disrupted by these kinds of changes, if you are importing via sbt, you'll get both projects as a transitive. If you are working in the IDE, you can still just right-click on the main method and run it, same results. Both projects will load into the IDE transparently. Maybe you could try it and give specific feedback?

— Reply to this email directly or view it on GitHub https://github.com/dlwh/epic/issues/38#issuecomment-124896814.

briantopping commented 9 years ago

Ok thanks for the feedback.

Part of decomposing modules into a DAG of modules means users who are trying to work with the code can get a faster map of what they are doing and where to focus their energy (it's a lot easier to dismiss whole swaths of code inaccessible to a given module). That said, it's never a good idea to create multiple modules from code that is always used together. Developers have to make arbitrary choices where code should go and it becomes a PITA.

I don't imagine generating models in production but building them in CI and deploying them wholesale in an upgrade process. Are you training in production? I'm starting to believe "tools" should be called "training", since that's what it seems most of the command line tools are oriented toward. But if all the training functionality needs to be a part of core... ?

briantopping commented 9 years ago

Okay, been nonstop on this, was useful to learn the code a bit more. Generated three separate trees, the last was as you suggested just kind of punted on the refactor. For the second one, modules were broken out as preprocess->client->core. Core started with everything in the existing monolith with elements gradually lifted into higher modules with the intent of being able to abstract out code that isn't necessary to understand the lower levels.

There's some issues that are common for a code base like this, for instance JavaSentenceSegmenter being used by SlabTest, TreebankTokenizer being used by Tree. Both of these are more arguably more client UI oriented than core functionality, but the dependencies in tests and core code mean they can't be lifted. Runtime injection helps a lot here as I see you found with the ServiceLoader, but that becomes very clumsy very quickly without a global DI framework or use of implicits (both of which bring their own headaches). I've checked both of them in on my repo for reference if you are curious.

The other option is API changes, but that's usually not well received without discussion, if even then.

In any event the PR that's generated has the build bits that will make it easier to start extracting modules when there's a better idea where to go with this.

dlwh commented 9 years ago

awesome thanks! I'll look now.

I'd be curious to hear your recommendations about API refactoring.

It's important to me that "core" always have the training code and everything needed for it. It helps researchers (esp. my labmates) get up and running quickly. And, especially because it doesn't add too much dependency weight, there's relatively little harm.

Tika is a beast and it definitely needs to split out as much as possible.

briantopping commented 9 years ago

Hey sorry for the delay to get back to you. This week has been a blur.

Totally understand what you're saying about the dependencies. Just curious though, is your crew using a build tool that deals with transitive dependencies well?

If so, there shouldn't ever be a problem that a user needs to include a dependency in their build except one they have direct client linkages to. The build tool will work out the classpath such that the indirect dependencies are included automatically.

In exchange for needing to know which library to link to initially, there's some positive tradeoffs:

The user isn't overwhelmed by a huge API, just the stuff they need to work with for the task at hand. This pays dividends because it's much easier to read the APIs for the more focused library and have a better chance that it was worth reading (instead of finding everything under the sun and giving up).
Releases are more focused. For instance, if the Parser is a separate module, the models can be bound to the parser. The build tool figures out the best closure for the parser to be both compatible and latest release, not require the user to sort that out.
More focused releases can lead to better testing. There's just less to test.
Iterations on modules show people with version numbers where the action in the code is.

As an occasional project owner, the reason I like modules is it keeps developers from making weird dependencies that are hard to unwind. It might be very useful that a debugging library knows how to print a tree (even though the tree printer imports a quarter of the universe), but it's far better if that's left to the deployer. It's the same problem as Tika dependencies in microcosm.

I totally appreciate such a refactor is much easier to say "yes" to after it's done, but I can almost assure you that you'll choke on your lunch if you saw one of these PRs. It's just really a leap of faith to approve it and a bigger deal for everyone to update their builds. But if you plan on continuing to improve the library (and I see that with the recent Neural commits), the sooner you start on this path, the better.

What can I do to help?

dlwh commented 9 years ago

hey, first, publish-signed went away and it doesn't ask for my p/w for my key anymore. what can i do?

briantopping commented 9 years ago

Oof, just got this. let me get on it. How much time do I have?

On Jul 29, 2015, at 8:57 PM, David Hall notifications@github.com wrote:

hey, first, publish-signed went away and it doesn't ask for my p/w for my key anymore. what can i do?

— Reply to this email directly or view it on GitHub https://github.com/dlwh/epic/issues/38#issuecomment-126167551.

dlwh commented 9 years ago

I mean, you replied in 10 min, that's pretty good. :)

I just figured it out

On Wed, Jul 29, 2015 at 8:09 PM, Brian Topping notifications@github.com wrote:

Oof, just got this. let me get on it. How much time do I have?

On Jul 29, 2015, at 8:57 PM, David Hall notifications@github.com wrote:

hey, first, publish-signed went away and it doesn't ask for my p/w for my key anymore. what can i do?

— Reply to this email directly or view it on GitHub < https://github.com/dlwh/epic/issues/38#issuecomment-126167551>.

— Reply to this email directly or view it on GitHub https://github.com/dlwh/epic/issues/38#issuecomment-126168630.

dlwh commented 9 years ago

(I just added the sbt-pgp plugin again)

On Wed, Jul 29, 2015 at 8:10 PM, David Hall david.lw.hall@gmail.com wrote:

I mean, you replied in 10 min, that's pretty good. :)

I just figured it out

On Wed, Jul 29, 2015 at 8:09 PM, Brian Topping notifications@github.com wrote:

Oof, just got this. let me get on it. How much time do I have?

On Jul 29, 2015, at 8:57 PM, David Hall notifications@github.com wrote:

hey, first, publish-signed went away and it doesn't ask for my p/w for my key anymore. what can i do?

— Reply to this email directly or view it on GitHub < https://github.com/dlwh/epic/issues/38#issuecomment-126167551>.

— Reply to this email directly or view it on GitHub https://github.com/dlwh/epic/issues/38#issuecomment-126168630.

briantopping commented 9 years ago

Whew! I was away from the computer for a while.

In the future, if you have an iPhone, you can reach me via messages on my email address...

dlwh commented 9 years ago

i'm one of those pesky android people :)

On Wed, Jul 29, 2015 at 8:12 PM, Brian Topping notifications@github.com wrote:

Whew! I was away from the computer for a while.

In the future, if you have an iPhone, you can reach me via messages on my email address...

— Reply to this email directly or view it on GitHub https://github.com/dlwh/epic/issues/38#issuecomment-126168835.

briantopping commented 9 years ago

emailed...

briantopping commented 9 years ago

Should the epic-parser-en-span_2.11.2015.7.29-SNAPSHOT work with master branch? I'm getting the following exception:

  Cause: java.io.InvalidClassException: epic.util.NotProvided$; local class incompatible: stream classdesc serialVersionUID = -649101350749082174, local class serialVersionUID = -4684693011277973677
  at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:621)
  at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1623)
  at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

briantopping commented 9 years ago

Meh, my bad. sbt publish-m2 didn't do the right thing the first time I ran it. All clear. Thanks for publishing the model snapshots!!

dlwh commented 9 years ago

you're welcome!

dlwh commented 9 years ago

On Wed, Jul 29, 2015 at 12:09 PM, Brian Topping notifications@github.com wrote:

Hey sorry for the delay to get back to you. This week has been a blur.

Totally understand what you're saying about the dependencies. Just curious though, is your crew using a build tool that deals with transitive dependencies well?

If so, there shouldn't ever be a problem that a user needs to include a dependency in their build except one they have direct client linkages to. The build tool will work out the classpath such that the indirect dependencies are included automatically.

Except they need to develop Epic, and so it's useful if everything they could need is in one project. Ideally one module. Research code bases (and epic is one) benefit from being able to quickly iterate, even if it degrades overall software engineering quality some.

In exchange for needing to know which library to link to initially, there's some positive tradeoffs:

The user isn't overwhelmed by a huge API, just the stuff they need to work with for the task at hand. This pays dividends because it's much easier to read the APIs for the more focused library and have a better chance that it was worth reading (instead of finding everything under the sun and giving up).

I'm not entirely unsympathetic to this, but isn't this more an argument for putting pure client stuff in a leaf project and everything needed for training and decoding (and researching) in the core upstream library?

Releases are more focused. For instance, if the Parser is a separate module, the models can be bound to the parser. The build tool figures out the best closure for the parser to be both compatible and latest release, not require the user to sort that out.

It's already the case that the parser model files depend on a particular version of epic? Sometimes (as with 2015.2.19 for a while) it works for a bit longer, but...

More focused releases can lead to better testing. There's just less to test.

I don't have much to say on this one.

Iterations on modules show people with version numbers where the action in the code is.

As an occasional project owner, the reason I like modules is it keeps developers from making weird dependencies that are hard to unwind. It might be very useful that a debugging library knows how to print a tree (even though the tree printer imports a quarter of the universe), but it's far better if that's left to the deployer. It's the same problem as Tika dependencies in microcosm.

Splitting out Tika was clearly a good idea and I had kind of wanted to do it anyway.

I go back and forth between splitting and merging. If you split too much it's hard to decide where things go and which modules are needed to just "parse stuff", or whatever. Breeze's ancestors used to be heavily split up, and it was annoying to deal with. (Where'd I put that file? Ah crap I need to move this test harness code into main because this other module needs to use it in its tests.) I now mostly segment things based on dependencies needed, which gives breeze-core, breeze-viz, and breeze-natives. The tika split works nicely with that (especially since epic doesn't have all that many dependencies besides Tika and whatever Breeze brings in.)

I totally appreciate such a refactor is much easier to say "yes" to after it's done, but I can almost assure you that you'll choke on your lunch if you saw one of these PRs. It's just really a leap of faith to approve it and a bigger deal for everyone to update their builds. But if you plan on continuing to improve the library (and I see that with the recent Neural commits), the sooner you start on this path, the better.

Anyway, I'm actually totally happy with the preprocess -> client -> core version of the pipeline, except that the corpus utils (MascUtil) need to be in core. (Clients don't need to deal with corpora like MASC directly...)

Clients can know that they just need to look at the classes in either preprocess or client. The researchers only have to look at one library. That seems ideal to me.

-- David

What can I do to help?

— Reply to this email directly or view it on GitHub https://github.com/dlwh/epic/issues/38#issuecomment-126061738.

dlwh / epic

Packaging refactor #38