Closed GoogleCodeExporter closed 9 years ago
Here is a list of changes:
https://groups.google.com/forum/?fromgroups#!topic/clearnlp/0uhP_VDJf9c
Mainly changes to the SRL part, including new models.
Original comment by richard.eckart
on 1 Sep 2013 at 12:50
Original comment by pedrobss...@gmail.com
on 10 Sep 2013 at 9:25
Original comment by richard.eckart
on 11 Sep 2013 at 12:28
We will *not* upgrade to 1.4.x at this time. The parser models for the new
version are hilariously large. The ontonotes model is ~1.4 GB! Not even xz can
squeeze this properly. The release notes of version 1.4.0 state the parser
models would be 6x larger (they had been ~60 MB, so that would be ~360 MB), but
its more like 23x larger!
Original comment by richard.eckart
on 13 Sep 2013 at 8:24
The upgrade should be done to 2.0.0, since this version will be released soon.
Original comment by pedrobss...@gmail.com
on 13 Oct 2013 at 12:57
Folks,
The ClearNLP 2.0.1 is released with minor bug-fixes. All general models are
also updated (see the following details).
There were some errors in the training data for the general models, which
caused to produce too many unclassified dependencies. All general models are
updated (see how to add models).
Constructors are added to EnglishMPAnalyzer and EnglishTokenizer; they can
accept the dictionary file as an InputStream.
new EnglishMPAnalyzer(new BufferedInputStream(new
FileInputStream("dictionary_filename")));
new EnglishTokenizer(new BufferedInputStream(new
FileInputStream("dictionary_filename")));
There was a bug in semantic role labeler causing a null point exception (see
post). This is fixed now.
Many of you have asked a more thorough documentation, which I don't have time
to work on at the moment, but this will be the first thing to do shortly after
the ACL deadline. Thanks and happy Thanksgiving!
best,
Jinho
Original comment by richard.eckart
on 28 Nov 2013 at 6:13
The new models from ClearNLP 2.0.1 are differently compressed. For instance,
the dependency parser model was compressed like:
parse-en-ontonotes
|--dep_CONFIGURATION
|--dep_FEATURE0
|--dep_LEXICA
|--dep_MODEL0
but now it is compressed like:
parse-en-ontonotes
|--general-en
| |--dep
|--META-INF
| |--maven
| | |--com.clearnlp
| | |--clearnlp-general-en-dep
| | |--pom.properties
| | |--pom.xml
| |--MANIFEST.MF
Besides that, the NLPGetter (former EngineGetter in ClearNLP 1.3.1) is working
now with ObjectInputStream instead of InputStream. So, to make it work, the
code which was before like:
is = aUrl.openStream();
CDEPPassParser parser = (CDEPPassParser)
EngineGetter.getComponent(is,getAggregatedProperties().
getProperty(LANGUAGE), NLPLib.MODE_DEP);
Is now like:
is = aUrl.openStream();
String language = getAggregatedProperties().getProperty(LANGUAGE);
zis = new ZipInputStream(is);
ZipEntry ze = zis.getNextEntry();
String zipEntryName =
getAggregatedProperties().getProperty(VARIANT).equals("mayo") ?
"medical-en/dep"
: "general-en/dep";
while(ze != null){
if(zipEntryName.equals(ze.getName())){
break;
}
ze = zis.getNextEntry();
}
gis = new GZIPInputStream(zis);
bis = new BufferedInputStream(gis);
ois = new ObjectInputStream(bis);
AbstractDEPParser parser = NLPGetter.getDEPParser(ois, language);
I really don't know if the model should be searched by a loop inside the java
code, or if we should do such a thing in the build.xml.
Please let me know your thoughts.
Original comment by pedrobss...@gmail.com
on 15 Jan 2014 at 7:55
I tried to convince Jinho to put the models into 1) separate Maven artifacts
and 2) into properly named packages in these artifacts. I succeeded with 1, but
not with 2.
I think we should unpack the stuff in the build.xml and pack it again with the
usual package structure that we have. That should at least remove the ugly line
where you have to check the VARIANT for "mayo" or "medical".
As an alternative, we could also change the variant names. to "general" and
"medical". Then we could wrap the official ClearNLP model JAR in our JAR as we
always did and still get rid of the line.
There has been some discussion on retaining constructors using simple
InputStreams here:
https://groups.google.com/forum/?fromgroups#!topic/clearnlp/6ffHP8tMjU8
Whatever we do, I think we should follow up on that thread and tell them our
current dilemma. I can take care of that.
Original comment by richard.eckart
on 15 Jan 2014 at 10:40
Posted this to the ClearNLP list, but as a continuation of a different thread:
https://groups.google.com/d/msg/clearnlp/3qynXATMhrQ/QAj3lgIRmfcJ
Original comment by richard.eckart
on 15 Jan 2014 at 11:00
Btw. if the models were placed into proper packages in the official JARs, we
would have done the following:
<install-model-stub
groupId="de.tudarmstadt.ukp.dkpro.core"
artifactIdBase="de.tudarmstadt.ukp.dkpro.core.stanfordnlp"
tool="parser"
language="en"
variant="ontonotes"
version="${corenlp.version}.1"
targetGroupId="com.clearnlp"
targetArtifactId="clearnlp-general-en-dep"
targetVersion="1.2"
targetClassifier=""
targetLocation="classpath:/general/dep">
<metadata>
...
</metadata>
</install-model-stub>
The model provider would then automatically redirect a resolving of
"classpath:/${package}/lib/parser-${language}-${variant}.properties" to the
actual model at "classpath:/general/dep".
Original comment by richard.eckart
on 15 Jan 2014 at 11:26
Agree, I will make the proper changes to the build.xml to avoid these lines in
the code. Regarding the constructors with a InputStream, as far I remember,
only EnglishMPAnalyzer has a constructor with InputStream in ClearNLP 2.0.1.
Another problem is regarding the ClearNlpPosTagger. EnglishPOSTagger creates an
instance of the EnglishMPAnalyzer, by calling a constructor from this class
without parameters, and since it does not have parameters, it tries to read the
dictionary by a path hard-coded in the class, which corresponds to the path of
the dictionary in the classpath when using the maven dependency for this
dictionary. To summarize it, to use the EnglishPOSTagger, it is necessary to
add the clearnlp dictionary dependency to the pom file. The thing is: This
dictionary is already wrapped as a resource and added to the classpath, so we
have the same file duplicated in the classpath. Any suggestions about this?
Original comment by pedrobss...@gmail.com
on 15 Jan 2014 at 2:22
Complain to Jinho about the hard-coded path. Maybe he can do a ClearNLP 3.0.2
Original comment by richard.eckart
on 15 Jan 2014 at 2:24
Can you tell the ClearNlpPosTagger *not* to initialize and call the
EnglishMPAnalyzer?
Original comment by richard.eckart
on 15 Jan 2014 at 2:25
Sorry about the late response, I was having a hard time to make the Semantic
Role Labeler work and forgot about this thread. I tried to find a way for not
calling this constructor without parameters from EnglishMPAnalyzer but there is
not.
Regarding the models, are they going to be deployed in artifactory? They are
huge...
Original comment by pedrobss...@gmail.com
on 24 Jan 2014 at 10:54
How large is huge?
Did you report the problem about the EnglishMPAnalyzer in the ClearNLP mailing
list already?
Original comment by richard.eckart
on 25 Jan 2014 at 12:19
About 700mb.
Not yet, Iam going to post that soon.
Original comment by pedrobss...@gmail.com
on 25 Jan 2014 at 12:24
Let's wait for the reaction from ClearNLP then. ClearNLP provides their models
also via Maven Central. We could always decide to use those - even though they
are not optimally packaged in my opinion.
Original comment by richard.eckart
on 25 Jan 2014 at 12:53
Regarding the problem with the dictionary being duplicated in the classpath, I
was able to solve it by extending EnglishPOSTagger and overriding the
initializer method which calls the EnglishMPAnalyzer constructor without
parameters. I've overwritten it to instantiate EnglishMPAnalyzer passing the
InputStream for our wrapped dictionary.
Original comment by pedrobss...@gmail.com
on 27 Jan 2014 at 2:02
Nice. I think it would still be good to bring this up on the ClearNLP list
though. Fixing such stuff via inheritance is not the best solution in the long
term.
Original comment by richard.eckart
on 27 Jan 2014 at 2:17
Original comment by pedrobss...@gmail.com
on 1 Feb 2014 at 3:19
Could someone please review r2188? :)
I am pretty sure it is not the most elegant solution, since the packaging from
ClearNLP models at maven central is different from our packaging.
Original comment by pedrobss...@gmail.com
on 3 Feb 2014 at 3:16
This issue was updated by revision r2212.
Resolve model via properties file - removed condition.
Original comment by richard.eckart
on 12 Feb 2014 at 9:47
This issue was updated by revision r2213.
Resolve model via properties file - removed condition.
Original comment by richard.eckart
on 12 Feb 2014 at 10:13
This issue was updated by revision r2214.
Resolve model via properties file - removed condition.
Original comment by richard.eckart
on 12 Feb 2014 at 10:18
Original comment by richard.eckart
on 18 Feb 2014 at 2:17
Original issue reported on code.google.com by
richard.eckart
on 1 Sep 2013 at 10:32