dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

Integrate FlexTag #611

Closed reckart closed 8 years ago

reckart commented 9 years ago
Add a flexible PoS Tagger that enables easy change of models including creating/using
self-trained ones.

Original issue reported on code.google.com by Tobias.Horsmann on 2015-04-20 15:43:17

reckart commented 9 years ago
As far as I can see, the flextag module introduces a possible circular dependency between
DKPro Core and DKPro TC. The only reason this circular dependency does not kick in
right now is, that the module uses a release version of DKPro TC.

Are there any ideas if/how we can avoid having DKPro TC depend on Core depend on TC?

Original issue reported on code.google.com by richard.eckart on 2015-04-21 11:57:14

reckart commented 9 years ago
I should ask differently: does anybody think this circular dependency is (or could be)
a problem?

Original issue reported on code.google.com by richard.eckart on 2015-04-21 12:03:08

reckart commented 9 years ago
I just checked: getting rid of DKPro Core dependencies in DKPro TC might not be an option,
there are several core modules (incl. de.tudarmstadt.ukp.dkpro.core.io.bincas-asl,
de.tudarmstadt.ukp.dkpro.core.api.metadata-asl) which are hard to avoid.

Original issue reported on code.google.com by daxenberger.j on 2015-04-21 13:50:39

reckart commented 9 years ago
BinCas classes could be copied over to TC. But anyway, they are not required when *using*
a model, they are part of the experimentation stuff. So they shouldn't be needed anyway.
Same for metadata-asl.

Original issue reported on code.google.com by richard.eckart on 2015-04-21 14:03:01

reckart commented 9 years ago
Ok, if we continue that way, maybe there is a chance to separate DKPro TC Core into
2 modules, and one which doesn't have any DKPro Core dependencies and can thus be reuse
in DKPro Core itself.

Original issue reported on code.google.com by daxenberger.j on 2015-04-21 14:18:13

reckart commented 9 years ago
The dependency first issue I noticed is that the SaveModelTask in TC uses Lucene 4.4.0
while DKpro still uses 3.0.3. Unless one sets lucene 4.4 explicitly an exception is
thrown when the saved model is loaded. (origin in this dependency mismatch)

A further thing is that I have to use the TC Snapshot 0.8 now due to bugfixes - is
there a bug fix release of TC scheduled. 

Original issue reported on code.google.com by Tobias.Horsmann on 2015-04-22 14:34:16

reckart commented 9 years ago
Reg. DKPro TC release: the 0.8.0 release probably won't happen before we moved to GitHub.
A bugfix release can be done anytime - if somebody volunteers to do so :)

Original issue reported on code.google.com by daxenberger.j on 2015-04-22 14:44:08

reckart commented 9 years ago
I am running some time-performance analysis and the TC backend is really slow. I used
here JCAS documents with 500 German sentence each and measured the time spend for feature
extraction, adding them into the feature store and writing the feature file to disc
(averaged over several 500 sentence documents).

Feat Extract: ~13.5 seconds
Add to FeatSt: ~5.8 seconds
Write to disc: 23 seconds

While we reach good performances in accuracy, we disappoint in speed. The biggest performance
drain is the IO of the training files to call CRFsuite. The actual tagging itself doesn't
take that long. 
Are there any ideas how to get beyond snail-speed?

Original issue reported on code.google.com by Tobias.Horsmann on 2015-04-24 13:29:11

reckart commented 9 years ago
Well, TC is not optimized for speed, but for flexibility.

Your specific problem also seems to be caused by the peculiarities of the CRFsuite
wrapper and not TC itself.

Original issue reported on code.google.com by torsten.zesch on 2015-04-24 13:33:04

reckart commented 9 years ago
Check if crfsuite can be run in a streaming mode where it reads data from stdin instead
of from a file, like we do for TreeTagger in DKPro Core.

Switch to a Java-based CRF implementation, e.g. the Stanford CRF implementation.

Original issue reported on code.google.com by richard.eckart on 2015-04-24 15:08:01

reckart commented 9 years ago
I have uploaded the model JAR for this project to our repository because otherwise the
build fails.

Doing so, I noticed that the model is quite sizeable for a POS tagger and appears to
include a Lucene index in addition to the crfsuite model. I wonder if these two parts
are so tightly coupled that they should be in the same artifact, or if they should
be two different artifacts. Mind, this is a rather philosophical question, because
right now, having them in one artifact facilitates the setup quite a bit. When implementing
the OpennlpSegmenter, I had to learn that having two different model artifacts for
a single analysis engine is a bit uncommon and uncomfortable at times (here: tokenizer
model / sentence splitter model).

Original issue reported on code.google.com by richard.eckart on 2015-04-27 08:03:31

reckart commented 9 years ago
Thanks for uploading the model.

About the size of the model:
This is a classical trade-off between convenience/flexibility and size.
We could make the model smaller by making it more specialized (e.g. not storing the
lucene index but only the part of the information that is actually used by the feature
extractors used).
At the moment I think flexibility beats size constraints, but we should keep an eye
on that.

Original issue reported on code.google.com by torsten.zesch on 2015-04-27 18:36:55

reckart commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2015-04-30 09:36:27

reckart commented 8 years ago

I feel this module should stay out of the upcoming 1.8.0 release (#702). I don't feel comfortable about the dependency on DKPro TC - and currently it is even a SNAPSHOT dependency.

Horsmann commented 8 years ago

Agreed. TC 0.8 should be released first bevor we use FlexTag.

Horsmann commented 8 years ago

@reckart What exactly is Jenkins trying to tell me with this dependency remark in the log file?

reckart commented 8 years ago
[WARNING] Unused declared dependencies found:
[WARNING]    org.dkpro.tc:dkpro-tc-features:jar:0.8.0:compile
[WARNING]    de.tudarmstadt.ukp.dkpro.core:de.tudarmstadt.ukp.dkpro.core.io.text-asl:jar:1.9.0-SNAPSHOT:compile
[WARNING]    org.dkpro.lab:dkpro-lab-core:jar:0.12.0:compile
[WARNING]    org.dkpro.tc:dkpro-tc-core:jar:0.8.0:compile

These dependencies are declared in the POM but they are not actually used in the code. To avoid extensive clean-up actions, we configured the Jenkins build to not let modules pass that declare too many or too few dependencies.

Horsmann commented 8 years ago

so - I add a dependency exclusion? or how do I solve this?

zesch commented 8 years ago

Or you remove the unused dependencies. Tobias Horsmann notifications@github.com schrieb am Mi., 6. Juli 2016 um 10:25:

so - I add a dependency exclusion? or how do I solve this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/611#issuecomment-230709564, or mute the thread https://github.com/notifications/unsubscribe/ACkQ4HIuFGWoZguJtZ1p17UT24E5de_-ks5qS2aDgaJpZM4Ea3wz .

Horsmann commented 8 years ago

The dependencies are in the flextag release. I cant just remove them

reckart commented 8 years ago

I assume what you want to say is that the dependencies are required by your model file. As far as I know, your model is a Maven artifact itself.

How about moving the TC dependencies from the DKPro Core module into your model's POM?

DKPro Lab shouldn't actually be required. Can this be removed?

Also, I don't think that your model should require a DKPro Core TextReader/TextWriter?

Horsmann commented 8 years ago

Yes, if there are models available you probably would need those dependencies.

I think the core dependency can be removed but I still don't understand how to react on the failing Jenkins. they are not needed at the moment or the need would come with a user-defined model. But the dependencies are part of the flextag-core/feature dependency. What do I do then?

reckart commented 8 years ago

If the dependencies are used in the code of the DKPro Core flextag module, then the dependency checking plugin should detect that. There are only few cases where the plugin does not correctly detect that code is actually being used. One such example is if you use only constants (e.g. final static String) from a class. If you believe that the plugin is wrongly detecting a required dependency as unused, please tell us where this dependency is used. In such a case, an override for the dependency plugin can be set up.