Extract a gate library that only does NLP

bamthomas commented 5 years ago

We are using Gate in our project http://github.com/ICIJ/datashare among other NLP pipelines (OpenNLP, CoreNLP, IxaPipe and Mitie).

We already have a library that extracts text from various files http://github.com/ICIJ/extract and is also using Tika (pdfbox...). We have conflicts with the Tika versions.

Would it be possible to extract a gate library that does only NLP annotations (text -> annotated text) without the text extraction stuffs ?

I can try to make a PR but before that I wanted to know what you guys are thinking about this?

Thank you for your answer(s).

greenwoodma commented 5 years ago

We use Tika for loading PDFs and some of the other document formats. If you don't need those formats (i.e. you just load plain text, html or XML) then you might be able to just exclude Tika from being a GATE dependency.

johann-petrak commented 5 years ago

Is there a reason why we need to depend on such an old version of Tika? I think we depend on 1.7 from January 2015 and the current version is 1.20.

BTW @bamthomas what kind of conflict do you get exactly?

greenwoodma commented 5 years ago

Is there a reason why we need to depend on such an old version of Tika? I think we depend on 1.7 from January 2015 and the current version is 1.20.

BTW @bamthomas what kind of conflict do you get exactly?

I think when I last tried to update it I got some odd errors from the unit tests and I didn't have the time to investigate them properly so left it on 1.7. It would make sense to upgrade it if we can.

johann-petrak commented 5 years ago

While waiting for a long download I tried to use Tika version 1.20 with the latest 8.6-SNAPSHOT code of gate-core and it turns out that now two of the libraries which are currently excluded need to get included in the pom dependencies: com.adobe.xmp/xmpcore and com.drewnoakes/metadata-extractor

When I remove those from the excludes, the compile and unit tests work fine. (Not including those gives class not found exceptions when running the unit tests. After including them, there is still a warning about a missing xerial's sqlite-jdbc, but the tests pass)

Using that GATE version from some of my LF tests and pipelines did not show any obvious bugs or problems.

greenwoodma commented 5 years ago

Odd. I thought both of those were only used by the image formats which we don't need, but if not excluding them works, then I say make the changes and let's update to Tika 1.20.

bamthomas commented 5 years ago

The conflict seems to be on pdfbox. When I include Gate, I have errors like :

java.io.IOException: \
    at org.apache.tika.parser.ParsingReader.read(ParsingReader.java:274)\
    at java.io.Reader.read(Reader.java:140)\
    at org.icij.spewer.Spewer.copy(Spewer.java:104)\
    at org.icij.spewer.Spewer.toString(Spewer.java:114)\
    at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer.getMap(ElasticsearchSpewer.java:115)\
    at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer.prepareRequest(ElasticsearchSpewer.java:81)\
    at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer.indexDocument(ElasticsearchSpewer.java:134)\
    at org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer.write(ElasticsearchSpewer.java:72)\
    at org.icij.extract.extractor.Extractor.extract(Extractor.java:272)\
    at org.icij.extract.extractor.DocumentConsumer.lambda$accept$0(DocumentConsumer.java:125)\
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\
    at java.lang.Thread.run(Thread.java:748)\
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.tika.parser.rtf.TextExtractor\
    at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:97)\
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\
    at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\
    at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)\
    at org.apache.tika.parser.ParsingReader$ParsingTask.run(ParsingReader.java:235)\

and other file types, that I don't have when I'm not including Gate jar. I have relocated with the maven shade plugin the tika/lucene/pdfbox/fontbox/james libraries.

But yes I could try to remove the dependencies when packaging my jar, even though I don't find this elegant.

johann-petrak commented 5 years ago

I do not really understand that exception, but we have in the meantime upgraded the Tika dependency of gate-core version 8.6-SNAPSHOT to Tika version 1.20. You may want to try if using that version improves anything (the SNAPSHOT is staged on our own repo at http://repo.gate.ac.uk/content/groups/public/)

Alternately you could just try having a dependency on your preferred Tika version in your top-level pom which maybe would then override the dependencies inferred from the GATE dependency, but I am not sure how exactly this gets handled by the shade plugin.

greenwoodma commented 5 years ago

There is a trick to do this by replacing the default creole.xml that gets loaded at Gate.init() but it requires knowing a lot about the internals of GATE and it needs to be an actual File. We're considering including a minimal version of the file inside gate-core.jar so that a single method call before initialization would allow you to switch to this version. You could then exclude Tika (and any other libs needed by the default resources) which would solve the original issue.

GateNLP / gate-core

Extract a gate library that only does NLP #65