codeaudit / dkpro-core-asl

Automatically exported from code.google.com/p/dkpro-core-asl
0 stars 0 forks source link

Upgrade to clearnlp-2.0.2 #213

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Upgrade to clearnlp-1.4.2

Original issue reported on code.google.com by richard.eckart on 1 Sep 2013 at 10:32

GoogleCodeExporter commented 9 years ago
Here is a list of changes: 
https://groups.google.com/forum/?fromgroups#!topic/clearnlp/0uhP_VDJf9c

Mainly changes to the SRL part, including new models.

Original comment by richard.eckart on 1 Sep 2013 at 12:50

GoogleCodeExporter commented 9 years ago

Original comment by pedrobss...@gmail.com on 10 Sep 2013 at 9:25

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 11 Sep 2013 at 12:28

GoogleCodeExporter commented 9 years ago
We will *not* upgrade to 1.4.x at this time. The parser models for the new 
version are hilariously large. The ontonotes model is ~1.4 GB! Not even xz can 
squeeze this properly. The release notes of version 1.4.0 state the parser 
models would be 6x larger (they had been ~60 MB, so that would be ~360 MB), but 
its more like 23x larger!

Original comment by richard.eckart on 13 Sep 2013 at 8:24

GoogleCodeExporter commented 9 years ago
The upgrade should be done to 2.0.0, since this version will be released soon.

Original comment by pedrobss...@gmail.com on 13 Oct 2013 at 12:57

GoogleCodeExporter commented 9 years ago
Folks,

The ClearNLP 2.0.1 is released with minor bug-fixes.  All general models are 
also updated (see the following details).
There were some errors in the training data for the general models, which 
caused to produce too many unclassified dependencies. All general models are 
updated (see how to add models).
Constructors are added to EnglishMPAnalyzer and EnglishTokenizer; they can 
accept the dictionary file as an InputStream.
new EnglishMPAnalyzer(new BufferedInputStream(new 
FileInputStream("dictionary_filename")));
new EnglishTokenizer(new BufferedInputStream(new 
FileInputStream("dictionary_filename")));
There was a bug in semantic role labeler causing a null point exception (see 
post). This is fixed now.

Many of you have asked a more thorough documentation, which I don't have time 
to work on at the moment, but this will be the first thing to do shortly after 
the ACL deadline.  Thanks and happy Thanksgiving!

best,

Jinho

Original comment by richard.eckart on 28 Nov 2013 at 6:13

GoogleCodeExporter commented 9 years ago
The new models from ClearNLP 2.0.1 are differently compressed. For instance, 
the dependency parser model was compressed like:

parse-en-ontonotes
|--dep_CONFIGURATION
|--dep_FEATURE0
|--dep_LEXICA
|--dep_MODEL0

but now it is compressed like:

parse-en-ontonotes
|--general-en
|   |--dep
|--META-INF
|   |--maven
|   |   |--com.clearnlp
|   |       |--clearnlp-general-en-dep
|   |           |--pom.properties
|   |           |--pom.xml
|   |--MANIFEST.MF

Besides that, the NLPGetter (former EngineGetter in ClearNLP 1.3.1) is working 
now with ObjectInputStream instead of InputStream. So, to make it work, the 
code which was before like:

is = aUrl.openStream();
CDEPPassParser parser = (CDEPPassParser) 
EngineGetter.getComponent(is,getAggregatedProperties().
                            getProperty(LANGUAGE), NLPLib.MODE_DEP);

Is now like:

is = aUrl.openStream();
String language = getAggregatedProperties().getProperty(LANGUAGE);
zis = new ZipInputStream(is);
ZipEntry ze = zis.getNextEntry();
String zipEntryName = 
getAggregatedProperties().getProperty(VARIANT).equals("mayo") ? 
"medical-en/dep" 
                        : "general-en/dep";
while(ze != null){
    if(zipEntryName.equals(ze.getName())){
        break;
    }
    ze = zis.getNextEntry();
}
gis = new GZIPInputStream(zis);
bis = new BufferedInputStream(gis);
ois = new ObjectInputStream(bis);
AbstractDEPParser parser = NLPGetter.getDEPParser(ois, language);

I really don't know if the model should be searched by a loop inside the java 
code, or if we should do such a thing in the build.xml.

Please let me know your thoughts.

Original comment by pedrobss...@gmail.com on 15 Jan 2014 at 7:55

GoogleCodeExporter commented 9 years ago
I tried to convince Jinho to put the models into 1) separate Maven artifacts 
and 2) into properly named packages in these artifacts. I succeeded with 1, but 
not with 2.

I think we should unpack the stuff in the build.xml and pack it again with the 
usual package structure that we have. That should at least remove the ugly line 
where you have to check the VARIANT for "mayo" or "medical".

As an alternative, we could also change the variant names. to "general" and 
"medical". Then we could wrap the official ClearNLP model JAR in our JAR as we 
always did and still get rid of the line.

There has been some discussion on retaining constructors using simple 
InputStreams here: 
https://groups.google.com/forum/?fromgroups#!topic/clearnlp/6ffHP8tMjU8
Whatever we do, I think we should follow up on that thread and tell them our 
current dilemma. I can take care of that.

Original comment by richard.eckart on 15 Jan 2014 at 10:40

GoogleCodeExporter commented 9 years ago
Posted this to the ClearNLP list, but as a continuation of a different thread:

https://groups.google.com/d/msg/clearnlp/3qynXATMhrQ/QAj3lgIRmfcJ

Original comment by richard.eckart on 15 Jan 2014 at 11:00

GoogleCodeExporter commented 9 years ago
Btw. if the models were placed into proper packages in the official JARs, we 
would have done the following:

<install-model-stub 
  groupId="de.tudarmstadt.ukp.dkpro.core" 
  artifactIdBase="de.tudarmstadt.ukp.dkpro.core.stanfordnlp"
  tool="parser" 
  language="en" 
  variant="ontonotes" 
  version="${corenlp.version}.1" 
  targetGroupId="com.clearnlp" 
  targetArtifactId="clearnlp-general-en-dep" 
  targetVersion="1.2" 
  targetClassifier=""
  targetLocation="classpath:/general/dep">      
  <metadata>
    ...                       
  </metadata>
</install-model-stub>   

The model provider would then automatically redirect a resolving of 
"classpath:/${package}/lib/parser-${language}-${variant}.properties" to the 
actual model at "classpath:/general/dep".

Original comment by richard.eckart on 15 Jan 2014 at 11:26

GoogleCodeExporter commented 9 years ago
Agree, I will make the proper changes to the build.xml to avoid these lines in 
the code. Regarding the constructors with a InputStream, as far I remember, 
only EnglishMPAnalyzer has a constructor with InputStream in ClearNLP 2.0.1.

Another problem is regarding the ClearNlpPosTagger. EnglishPOSTagger creates an 
instance of the EnglishMPAnalyzer, by calling a constructor from this class 
without parameters, and since it does not have parameters, it tries to read the 
dictionary by a path hard-coded in the class, which corresponds to the path of 
the dictionary in the classpath when using the maven dependency for this 
dictionary. To summarize it, to use the EnglishPOSTagger, it is necessary to 
add the clearnlp dictionary dependency to the pom file. The thing is: This 
dictionary is already wrapped as a resource and added to the classpath, so we 
have the same file duplicated in the classpath. Any suggestions about this?

Original comment by pedrobss...@gmail.com on 15 Jan 2014 at 2:22

GoogleCodeExporter commented 9 years ago
Complain to Jinho about the hard-coded path. Maybe he can do a ClearNLP 3.0.2

Original comment by richard.eckart on 15 Jan 2014 at 2:24

GoogleCodeExporter commented 9 years ago
Can you tell the ClearNlpPosTagger *not* to initialize and call the 
EnglishMPAnalyzer?

Original comment by richard.eckart on 15 Jan 2014 at 2:25

GoogleCodeExporter commented 9 years ago
Sorry about the late response, I was having a hard time to make the Semantic 
Role Labeler work and forgot about this thread. I tried to find a way for not 
calling this constructor without parameters from EnglishMPAnalyzer but there is 
not.

Regarding the models, are they going to be deployed in artifactory? They are 
huge...

Original comment by pedrobss...@gmail.com on 24 Jan 2014 at 10:54

GoogleCodeExporter commented 9 years ago
How large is huge? 

Did you report the problem about the EnglishMPAnalyzer in the ClearNLP mailing 
list already?

Original comment by richard.eckart on 25 Jan 2014 at 12:19

GoogleCodeExporter commented 9 years ago
About 700mb.
Not yet, Iam going to post that soon.

Original comment by pedrobss...@gmail.com on 25 Jan 2014 at 12:24

GoogleCodeExporter commented 9 years ago
Let's wait for the reaction from ClearNLP then. ClearNLP provides their models 
also via Maven Central. We could always decide to use those - even though they 
are not optimally packaged in my opinion.

Original comment by richard.eckart on 25 Jan 2014 at 12:53

GoogleCodeExporter commented 9 years ago
Regarding the problem with the dictionary being duplicated in the classpath, I 
was able to solve it by extending EnglishPOSTagger and overriding the 
initializer method which calls the EnglishMPAnalyzer constructor without 
parameters. I've overwritten it to instantiate EnglishMPAnalyzer passing the 
InputStream for our wrapped dictionary.

Original comment by pedrobss...@gmail.com on 27 Jan 2014 at 2:02

GoogleCodeExporter commented 9 years ago
Nice. I think it would still be good to bring this up on the ClearNLP list 
though. Fixing such stuff via inheritance is not the best solution in the long 
term.

Original comment by richard.eckart on 27 Jan 2014 at 2:17

GoogleCodeExporter commented 9 years ago

Original comment by pedrobss...@gmail.com on 1 Feb 2014 at 3:19

GoogleCodeExporter commented 9 years ago
Could someone please review r2188? :)
I am pretty sure it is not the most elegant solution, since the packaging from 
ClearNLP models at maven central is different from our packaging.

Original comment by pedrobss...@gmail.com on 3 Feb 2014 at 3:16

GoogleCodeExporter commented 9 years ago
This issue was updated by revision r2212.

Resolve model via properties file - removed condition.

Original comment by richard.eckart on 12 Feb 2014 at 9:47

GoogleCodeExporter commented 9 years ago
This issue was updated by revision r2213.

Resolve model via properties file - removed condition.

Original comment by richard.eckart on 12 Feb 2014 at 10:13

GoogleCodeExporter commented 9 years ago
This issue was updated by revision r2214.

Resolve model via properties file - removed condition.

Original comment by richard.eckart on 12 Feb 2014 at 10:18

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 18 Feb 2014 at 2:17