dkpro / dkpro-tc

UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
https://dkpro.github.io/dkpro-tc/
Other
34 stars 19 forks source link

CRFSuite: Creating standalone runnable jars with self-trained models #234

Closed daxenberger closed 9 years ago

daxenberger commented 9 years ago

Originally reported on Google Code with ID 234

Hi I looked into the tasks necessary to compile a standalone runnable jar from a trained
model.

Here the main TODOs I see:

TODO:
Placing created files/resources within the project-folder to be available if the uber-jar
is executed
Main effort is providing all resources to a location the project-folder (model-related,
but also feature-related resources)
At the moment it is unknown which parameter are "resource paths"(e.g. use resources)
and which are just constants

 + Introducing a convention for marking such resource-parameters would be helpful to
automatically collect the resources 

 + Adding an own "uber-jar preparation Task" which does this exact copying/preparing
before the jar is build by "mvn install" would be convenient  

 Saving: 
 + SaveModelCRFSuiteBatchTask enables a free choice where the serialized model-files
shall be stored
 How about enforcing (!) storing of this file within the project that makes the loading
in the uber-jar scenario easier (e.g. everything has to be in the project (src/main/resources
?))

 + parameters.txt
 This file contains the parameters of the features including paths to needed resources
 Only absolute paths that are scattered over the HDD of the user
 The "lucenceDir"-parameter for instance points to a folder relative to DKPRO_HOME
 Other resources might lie wherever they are placed by the user

 Loading:
 + The TcAnnotatorSequence class requires a parameter PARAM_TC_MODEL_LOCATION
 This folder should point to a folder in the project and contain all resources

 + All resources have to be copied to temporary files before they can be used (e.g.
ClassLoader-> getAsResourceStream or something)
 + Loading routines have to be extended accordingly to support the copy-operation

 Anyone any thoughts / comments on this. Especially the "marking" resource paths and/or
adding conventions where to store files that they are loadable from the final jar

Reported by Tobias.Horsmann on 2015-03-07 13:24:42

daxenberger commented 9 years ago
Why do we need an uber-jar?

Reported by richard.eckart on 2015-03-07 17:24:21

daxenberger commented 9 years ago
It would be nice to make some things to an actual "product" and provide a runnable tool
(aka uber-jar) that just works if you give it some input.

We would like to release an own PoS-Tagger for social media in the future - without
expecting that the user is a programmer that knows Eclipse, TC and all that stuff.

It is a bit of work, but if you can release your research as standalone runnable product
with all dependencies and stuff being take care of - it would be quite an advantage
for TC. 

Reported by Tobias.Horsmann on 2015-03-07 17:30:18

daxenberger commented 9 years ago
As a developer of DKPro Core, I'm interested in tools that can be integrated easily
into UIMA pipelines. Uber-JARs are notoriously problematic because they are highly
likely to conflict with other classes on the classpath. Thus, I'd be more interested
something a step short of an uber-jar: a model and (if necessary) a "light" JARs that
I can wrap and integrated as an UIMA component - or that already is a DKPro compatible
UIMA component and can be added to a pipeline directly.

I suppose based on that, I could always use the Maven shade plugin to create an uber-jar
if I wanted. We did that already with DKPro Core pipelines. Alternatively, I could
build a Groovy script that downloads the stuff from repositories.

Reported by richard.eckart on 2015-03-07 17:35:13

daxenberger commented 9 years ago
Yes, we already have a model loading pipeline feature for created model files. If you
create an uber-jar from such a pipeline that loads a model it rains "file-path not
found exception" because all paths are absolute which is the key problem if you try
to use them from within the jar file.

I have only used the uber-jars so far, I thought it is the only way to make something
"runnable" outside of Eclipse?
Including the resources by downloading them is a possibility, but this sounds like
more effort to me. I thought about something were you provide once all the stuff, prepare
things for wrapping into the uber-jar and then just be done. 
I am not sure how severe the dependency clash problem for ueber-jar are. 

Reported by Tobias.Horsmann on 2015-03-07 18:01:04

daxenberger commented 9 years ago
uber-jar clashes are hell. E.g. we could not make use of the TWSI library that was provided
as a uber-jar in any pipeline together with the Stanford tools, because TWSI included
a copy of the Stanford classes.

Uber-jars are nice for stand-alone applications - but they are completely unusable
within larger pipelines.

Of course models should not use absolute paths, at least not absolute paths with respect
to the file system. You can use absolute paths within the classpath - that is what
DKPro Core is doing all the time. It just needs to be made sure that a proper package
structure is used - i.e. that models are not simply stored e.g. as "models/en.bin"
in a JAR because that will also cause clashes.

The "export as JAR" in Eclipse never worked well for me because of the way that uimaFIT
handles type detection.
For this reason, I am building runnable JARs using the maven-shade-plugin:

http://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#ugr.tools.uimafit.packaging

Also nice are the Groovy scripts that we have in DKPro Core. They use Maven dependencies
for components and the DKPro Core model auto-loading mechanism for loading models (although
models could equally be added as Maven dependencies to the scripts - we just do not
do it because it saves some lines of code):

https://code.google.com/p/dkpro-core-asl/wiki/DraftGroovyIntro

Such scripts as not as fully standalone as a uber-jar because they still require that
they can access a Maven repository, but on the other hand they are really nice and
short and serve as good examples.

I think it would be good to solve the file loading problems before taking the next
step of creating an uber-jar - and for creating an uber-jar I would strongly recommend
the approach described in the uimaFIT documentation mentioned above.

What do you think about setting up a wiki page to write up a specification mentioning
the requirements and envisioned solutions? While discussing here, we might easily loose
track of what we actually want, why, and how we imagine to solve it.

Reported by richard.eckart on 2015-03-08 08:32:27

daxenberger commented 9 years ago
I think I meant the maven-shaded way when I was talking about Ueber-jar, apparently
not the same ?

The other Wiki pages are all of the kind "how to use TC". I see no discussion pages
for work in progress?

Reported by Tobias.Horsmann on 2015-03-08 12:18:58

daxenberger commented 9 years ago
There are various ways of building uber-jars. The maven-shade-plugin is special in the
sense that it can be configured to properly handle (merge) certain configuration files
that reside in well-known places in the classpath. Other uber-jar builders tend to
fail to handle such files properly.

Just create a new wiki page ;) Back in the olden days, when I started implementing
the resource resolving mechanism in DKPro Core, I set up such a page in the DKPro Core
wiki [1]. This page eventually turned into the seed for documentation on resource packaging
in DKPro Core, but the requirements collected are still clearly visible.

[1] https://code.google.com/p/dkpro-core-asl/wiki/ResourceProviderAPI

Reported by richard.eckart on 2015-03-08 20:32:54

daxenberger commented 9 years ago
I wrote things together here: https://code.google.com/p/dkpro-tc/wiki/ReusableTCModels

The page is not linked, didn't know where to place it...

Reported by Tobias.Horsmann on 2015-03-09 07:32:47

daxenberger commented 9 years ago
Status:
I have now a minimal example of how to install and load a TC created model as a DKPro
component. New things/thoughts that came up: 

A machine learning product might depend on a tailored preprocessing e.g. ArkTagger
should be used with ArkTokenize. It should be possible to ship this 'recommended' preprocessing
with the product and offer a switch to turn it on/off. We thus would need to collect/ship
preprocessing information in an own file.

The SaveModelTask should be extended and offer a switch to automatically install a
model into the local maven repository. Is it possible to create an ant-script during
runtime of the task (as the one ones currently used for installing various models by
hand e.g. Treetagger etc.) or is another way maybe more advisable?
Maybe a sample-code snippet should be provided of how to load and use this model in
an annotator to provide a minimal-working-sample. 

Reported by Tobias.Horsmann on 2015-04-09 07:01:10

daxenberger commented 9 years ago
I'm not sure if models should be automatically installed as Maven Artifacts. I think
instead, it should be possible to manually set a PARAM_MODEL_LOCATION on the annotator
component to point it to an undeployed model.

Reported by richard.eckart on 2015-04-12 19:58:10

daxenberger commented 9 years ago
I also prefer Richard's suggestion. As compared to POS tagging models and the like,
models trained with TC are very specific.
Once we figure out that a single TC model is really used by several users/components,
we can think about a (semi-automatic) procedure to deploy them into a local repository.

Reported by daxenberger.j on 2015-04-14 13:12:51

daxenberger commented 9 years ago
I also think that a model should be pretty simple. 

It should not contain any code to do pre-processing.

It should contain metadata that describes the provenance of the model, e.g. an XML
descriptor for the pipeline that was used to create it, information about the data
is was created from, who created it, when it was created, what tagset was used, etc.

However, in practice we find, that most models out there do not come with such information.
I think this can be added iteratively after an initial proof-of-concept.

If it is possible to implement a tool that can - based on this metadata - reassemble
the preprocessing pipeline and run it -- or that can verify if an alternative pre-processing
pipeline is compatible with the present model - then this is very cool. But I think
this should be left to a separate to-be-implemented tool and not be a property of the
present proof-of-concept, of the model, or of the core model saving/loading code. 

Reported by richard.eckart on 2015-04-15 08:21:51

daxenberger commented 9 years ago
Agree. I would not mix preprocessing (code) and machine learning model in the same file.
Metadata is necessary of course (and, as Richard suggested, might in a later stage
be used to verify "compatibility" of model and user-specified preprocessing).

Reported by daxenberger.j on 2015-04-15 08:31:50

daxenberger commented 9 years ago
Hi,

I used this model training/storing a few times now for creating models in Core for
FlexTag. The more often I use it the less I like the idea of providing user-defined
features by defining them in the component that uses the model.
The user defined features, especially the package naming makes the whole thing extremely
fragile. 
Once we started providing models for other languages as German/English this will lead
to a cluttered structure of features where no one will now exactly which models uses
which features.
FlexTag will become a user-feature graveyard where no one dears to touch anything knowing
he will break something.
If one yet decide to change anything he/she will have to retrain or at least fix by
hand the features.txt of N models (no one will do that...)

I know it will be a pain with a lot of work and I do not know yet how exactly this
should look like, but we should ship user-defined features as .class files together
with the model.
There is no point in copying the user-defined features over and over again. This will
lead to N similar features which probably won't be reuseable anyway because no one
but the original creator knows what this particular feature was intended for.
(If we manage to implement this for the user-defined features we should add the TC
features into it, too)
The model should not need anything else but the classifier-module (e.g. Weka, CRFsuite).
Features and their configuration should be part of the model file. This would also
be useful from the view-point of versioning. Altering features later would have no
effect on any models created before this change.
I rather suffer now through this how-to-provide-features than resolving the chaos in
a year.

Reported by Tobias.Horsmann on 2015-05-30 08:44:29

daxenberger commented 9 years ago
I think your argumentation is reasonable.

Instead of adding the FEs as class files, I see two more alternatives:

* shipping the FEs as mini-scripts, e.g. in Groovy - in that way they stay human-readable
even in the model
* not shipping the FEs at all. In all other DKPro Core wrappers, it is the job of the
wrapper to extract features from the CAS and prepare them in a way that the wrapped
tool/model expects them. The same could be done here - but it would limit the flexibility
of the model of course.

Reported by richard.eckart on 2015-05-30 08:49:41

daxenberger commented 9 years ago
The Groovy-way sounds interesting.
I have never used Groovy, but a something human-readable is preferable. How exactly
would the Groovy method look like? Is it a kind of Groovy-serialization or how would
a script representation of a FE look like
?

Reported by Tobias.Horsmann on 2015-05-31 09:43:41

daxenberger commented 9 years ago

Reported by Tobias.Horsmann on 2015-05-31 09:43:59

daxenberger commented 9 years ago
There are basically two approaches:

1) using a basic Groovy script - to do so, you would set up an environment with some
well-known variables, e.g. some variable for the CAS and the script would access those,
possibly writing results to another well-known variable. 

2) using a dynamically compiled Groovy-based class - here, you'd define an interface
in Java, create a class in a Groovy file which is loaded and compiled at runtime

Having tried both in the past, I'd go for the second option because if makes it more
obvious which API is provided be the interface. I don't like the "well known" variables
too much. But both approaches do have their merits.

Documentation can be found here: http://docs.codehaus.org/display/GROOVY/Embedding+Groovy
- for the second approach in particular in the section "Dynamically loading and running
Groovy code inside Java".

Reported by richard.eckart on 2015-05-31 09:55:12

daxenberger commented 9 years ago
I looked into it and method 2) looks really elegant.
Do you see a way how to automatically copy the source files of the feature during the
"save model" task?

Currently we write the package-path of a feature (e.g. de.tudarmstadt.ukp.dkpro.tc.features.length.NrOfCharsUFE)

We would need now the java source file. Training/Storing of WSJ models something you
want to do on a server thus it have to work from within a jar-file, too. 
Do you see a way to achieve this or will this require manual action i.e. copy feature
source files by hand into a folder with fixed convention in which the model loader
will look for features?

Reported by Tobias.Horsmann on 2015-05-31 14:16:23

daxenberger commented 9 years ago
I think it is reasonable to ask the one creating a model to copy the sources files over
by hand or does anyone see an automatized way of providing the source-files of a class
as part of the TC output? 

Reported by Tobias.Horsmann on 2015-06-01 18:42:24

daxenberger commented 9 years ago
My suggestion was kind of going into the direction of not having class files to start
with and always used only "feature extraction scripts". 

Reported by richard.eckart on 2015-06-01 18:45:06

daxenberger commented 9 years ago
Oh, so you would use the source-files as resources and load them by groovy during the
TrainStore-Task?  

This would complicate the training process a bit as one would no longer be able to
just reference them by name.
We would have to alter the feature-extraction task for this, right? We no longer would
have a list of names e.g. Arrays.asList{MyFeature.class.getName()} 

Reported by Tobias.Horsmann on 2015-06-01 18:51:36

daxenberger commented 9 years ago
Well, there's no such thing as a free beer ;) You asked for human readability - this
is one way to get it. The easier road would probably be one where you include class
files in the model JARs and use a custom class-loader to load them. 

Btw. in both cases you'd need some way to trace imports or assume that e.g. all files
within a certain package are self-contained or that the model packager explicitly says
which classes are transitively required by the FEs

Reported by richard.eckart on 2015-06-01 18:54:53

daxenberger commented 9 years ago
I assume that all FE dependencies (super classes, too) exist in the module that will
use the model.  

I think I still like the approach although I am not sure how exactly things would have
to look like.
Do you have a pointer where to start? Do I have to change DKPro Lab for this, too or
is this still only in TC?

Reported by Tobias.Horsmann on 2015-06-01 19:23:37

daxenberger commented 9 years ago
Let's leave Groovy out of the way for a moment because I think it makes the whole thing
a little more complicated.

So if you want to embed the FEs in the model, I'd suggest to use Class.getResource()
to locate the class files and then dump them into the model JAR into their original
packages. If the model JAR is on the classpath of the module using the model, that
should simply work.

The next step would IMHO be to use a custom classloader in TC to make sure the FE classes
are only loaded from the model JAR and not e.g. accidentially from some other model
JAR that might just include an FE with the same name.

DKPro Lab should not be affected. I think I mentioned it before - flextag should not
depend on DKPro Lab in any way.

Reported by richard.eckart on 2015-06-01 19:30:17

daxenberger commented 9 years ago
I implemented a version with storing/loading features as .class files. I run into issue
with Jenkins. The TC features are not found.

I am not really sure what the issue is here. The .class should be in the .jar, no?

de.tudarmstadt.ukp.dkpro.lab.engine.ExecutionException: java.io.FileNotFoundException:
Source '/home/svc_jenkins/workspace/DKPro%20Text%20Classification%20Framework%20(Google%20Code)/dkpro-tc-features/target/dkpro-tc-features-0.8.0-SNAPSHOT.jarde/tudarmstadt/ukp/dkpro/tc/features/length/NrOfCharsUFE.class'
does not exist
    at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:767)
    at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:731)
    at de.tudarmstadt.ukp.dkpro.tc.core.util.SaveModelUtils.writeFeatureClassFiles(SaveModelUtils.java:176)
    at de.tudarmstadt.ukp.dkpro.tc.crfsuite.task.serialization.ModelSerializationDescription.execute(SaveModelCRFSuiteBatchTask.java:213)
    at de.tudarmstadt.ukp.dkpro.lab.engine.impl.ExecutableTaskEngine.run(ExecutableTaskEngine.java:55)

Reported by Tobias.Horsmann on 2015-06-02 16:30:49

daxenberger commented 9 years ago
The error is reproducible in the lokal workspace if one does "Maven install"
It seems like there is a "/" missing between SNAPSHOT.jar and the feature path in between.
I tried setting this "/", but it did not help. 
The round-trip fails because the class files are seemingly not found in the jar :(

Anyone an idea?

Reported by Tobias.Horsmann on 2015-06-02 17:32:24

daxenberger commented 9 years ago
The copy operation of the .class files fails in .jar mode. Dunno what the problem is
ERROR DefaultLoggingService:41 - [ModelSerializationDescription-TestSaveModel-9d830b77-0951-11e5-ac5e-b32249b941da]
Task failed [de.tudarmstadt.ukp.dkpro.tc.crfsuite.task.serialization.ModelSerializationDescription-TestSaveModel](caused
by FileNotFoundException: Source '/Users/toobee/.m2/repository/de/tudarmstadt/ukp/dkpro/tc/dkpro-tc-features/0.8.0-SNAPSHOT/dkpro-tc-features-0.8.0-SNAPSHOT.jar/de/tudarmstadt/ukp/dkpro/tc/features/length/NrOfCharsUFE.class'
does not exist)

Reported by Tobias.Horsmann on 2015-06-02 18:09:34

daxenberger commented 9 years ago
I left some code comments on some of the commits you did. You should have received corresponding
mails from GoogleCode.

Reported by richard.eckart on 2015-06-03 08:02:20

daxenberger commented 9 years ago
I fixed it thx for the hints.

This are enough commits under this issue.

Reported by Tobias.Horsmann on 2015-06-03 19:22:32