Closed reckart closed 9 years ago
Do you have any idea how they officially call this format? I know "Penn Treebank" format
only as the bracketed structure.
Original issue reported on code.google.com by richard.eckart
on 2014-08-01 11:34:21
>> - corpora contain noun phrase annotations (in addition to the tags), is there a type
to annotate noun phrases in DKPro?
one possibility is to use the type Constituent and set constituentType to nounPhrase
Judith
Original issue reported on code.google.com by eckle.kohler
on 2014-08-01 11:40:07
@noun phrases: I'd suggest using "Chunk" annotations
@multiple POS tags: I'd take only the first one
@no-tag: there is no 'no tag' value, but I think there is a . You could simply not
have a POS annotation for those tags. It might cause problems with downstream components
that expect that all tokens have a POS. You could consider to run a pos tagger which
accepts partially pre-tagged tags to fill in the tags. In principle, TreeTagger could
do that, but I believe the DKPro Core TreeTagger component does handle partially pre-tagged
documents.
Original issue reported on code.google.com by richard.eckart
on 2014-08-01 11:45:53
@noun phrases: I'd suggest using "Chunk" annotations
why?
chunks and noun phrases are not the same;
for some users (e.g. me) this might be confusing
Original issue reported on code.google.com by eckle.kohler
on 2014-08-01 11:56:59
I don't know the name of the format, sorry. I don't think it has a name, its forward-slash
separated token/tag plain text annotation. The NP marking in brackets was, I assume,
the reason why they added so many line breaks - to make the text file more easily to
process.
Following new case, they also annotated if a word was misspelled and if yes they added
the tag it should have had if it were written correctly as in the example:
the/DT students/^NNS^POS parents/NNS
the missing ' caused the NNS, but it should have been students' and thus POS as tag.
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-01 12:12:06
>> @noun phrases: I'd suggest using "Chunk" annotations
> why? chunks and noun phrases are not the same; for some users (e.g. me) this might
be confusing
Constituents are modeled in a hierarchy in DKPro Core (they have a parent/children
references).
Chunks are modelled flat in DKPro Core (they have no such references).
Original issue reported on code.google.com by richard.eckart
on 2014-08-01 12:18:10
>>Constituents are modeled in a hierarchy in DKPro Core (they have a parent/children
references).
>>Chunks are modelled flat in DKPro Core (they have no such references).
ok - but what is actually annotated in the PTB: chunks or noun phrases?
if the "noun phrase" annotation is mapped to DKPro chunks, then information about the
hierachical structure of noun phrases is lost
Original issue reported on code.google.com by eckle.kohler
on 2014-08-01 12:31:06
"Originally, each of the texts was run through PARTS (Ken Church's
stochastic part-of-speech tagger) or Eric Brill's tagger and then corrected
by a human annotator. The square brackets surrounding phrases in the texts
are the output of a stochastic NP parser that is part of PARTS and are best
ignored."
This is how it looks like in the files:
==================================
[ Local/JJ industry/NN 's/POS investment/NN ]
in/IN
[ Rhode/NNP Island/NNP ]
was/VBD
[ the/DT big/JJ story/NN ]
in/IN
[ 1960/CD 's/POS industrial/JJ development/NN effort/NN ]
./.
==================================
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-01 12:34:06
in the example you give, Tobias, the things in square brackets are only chunks
- so Richard's suggestion (using Chunk) will be fine
an example of a noun phrase would be
[ Local/JJ industry/NN 's/POS investment/NN in/IN Rhode/NNP Island/NNP ]
Original issue reported on code.google.com by eckle.kohler
on 2014-08-01 12:39:17
Ok, thx for the feedback.
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-01 12:49:40
Where should I place the new file? Project: de.tudarmstadt.ukp.dkpro.core.io.penntree-asl
In the same package as the parsing-related classes or open a new package?
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-01 14:13:58
Same package. Just don't call it PennTreebankReader ;) That would be the one for the
bracketed structure. Your's should have a different name.
Original issue reported on code.google.com by richard.eckart
on 2014-08-01 14:16:18
Hm.... feel free to make suggestions, seems like my most favored name is not available
:)
How is:
PTB[Chunked]TaggedCorpusReader
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-01 14:21:33
PennTreebankChunkedReader?
Original issue reported on code.google.com by richard.eckart
on 2014-08-01 14:22:56
I think I like your name better :)
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-01 14:25:38
ok, I committed.
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-01 14:40:37
Please merge all the test cases into one class called PennTreebankChunkedReaderTest.
For the parameters, please use the standard parameters from ComponentParameters, e.g.:
/**
* Location of the mapping file for part-of-speech tags to UIMA types.
*/
public static final String PARAM_POS_MAPPING_LOCATION = ComponentParameters.PARAM_POS_MAPPING_LOCATION;
@ConfigurationParameter(name = PARAM_POS_MAPPING_LOCATION, mandatory = false)
protected String posMappingLocation;
For the tests, please use the DKPro Core AssertAnnotations methods, cf. OpenNlpParserTest
and Conll2000ReaderTest.
No need to use PARAM_PATTERNS, you can merge that information into the PARAM_SOURCE_LOCATION
unless you have multiple include/exclude patterns.
For unit tests just write "throws Exception" instead of listing each exception separately.
Original issue reported on code.google.com by richard.eckart
on 2014-08-01 18:53:38
This issue was updated by revision r2673.
- Formatting / cleaning up
Original issue reported on code.google.com by richard.eckart
on 2014-08-02 19:30:28
This issue was updated by revision r2674.
- Some formatting / cleaning up
Original issue reported on code.google.com by richard.eckart
on 2014-08-02 19:36:01
I updated the recent commit messages. Please check them out in the history to see how
they should be written such that they also update the issue with the changes (see the
two auto-generated comments above).
There are still various things to be fixed in the PennTreebankChunkedReader:
https://code.google.com/p/dkpro-core-asl/source/detail?r=2666
Original issue reported on code.google.com by richard.eckart
on 2014-08-02 19:37:57
Ehm where/how do I see what has to be fixed?
btw. Eclipse uses auto-format .xml files that defines how code is formatted if the
Eclipse-Key-Shortcut is used, you don't use the Eclipse default, aren't you? Where
do I get the DKPro-Version of these files?
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-02 19:47:33
Follow the link to revision 2666 in the previous comment and check out all the Line-by-line
comments. One of them includes a link to the Eclipse code style file as well.
I'm using Eclipse. I format using the keyboard-shortcut, but often I format only select
parts of a file, not the whole file, because some lines I actually don't like to be
auto-formatted, e.g. when I align parameter/value pairs in createEngineDescription(...)
such that there is one pair per line.
Original issue reported on code.google.com by richard.eckart
on 2014-08-02 21:02:06
Hi Tobias,
direct link to the style xml here (from "Downloads"): https://code.google.com/p/dkpro-core-asl/downloads/detail?name=DKProCoreStyle_20120326.xml&can=2&q=
Original issue reported on code.google.com by eriklan.dodinh
on 2014-08-03 08:23:29
This issue was updated by revision r2675.
- Fixed value of PARAM_TAGSET in test case.
Original issue reported on code.google.com by richard.eckart
on 2014-08-03 21:26:35
ok, I saw you updated files. Is there anything left to do?
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-04 06:58:56
Yes - I didn't address many of the comments that I made.
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 07:37:57
Maybe I look at the wrong place, but I see nothing. If I look on the code in the browser
I noticed that I can add comments, but I don't see any already attached comments? Where
do I have to look?
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-04 07:44:10
If you follow this link:
https://code.google.com/p/dkpro-core-asl/source/detail?r=2666
and you scroll down, you should see a section *Line-by-line comments*.
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 07:46:29
hm, no. I see the section Line-by-line comments, but it says that no comments have been
added yet. Maybe its a permission problem?
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-04 07:55:58
Stupid me... I haven't used the review tool often yet and forgot to actually publish
the review ;) Now you should be able to see them.
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 08:04:02
Ok, I can see them now.
is there no pre-implemented file-loading code in the other super-class? I do have to
reimplement the entire file loading code? What is the benefit of this class btw. It
seems only less convenient....
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-04 09:46:46
> is there no pre-implemented file-loading code in the other super-class? I do have
to reimplement the entire file loading code? What is the benefit of this class btw.
It seems only less convenient....
You mean in JCasResourceCollectionReader_ImplBase? It extends ResourceCollectionReaderBase
(which has the loading code) but it makes sure that you get a JCas instead of a CAS
in the getNext() method.
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 09:48:24
If course your class needs to override getNext(JCas aJCas) now instead of getNext(CAS
cas).
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 09:49:02
Ah ok.
Does select(jcas, Token.class) also work if I inherit from JCasResourceCollectionReader_ImplBase
? Seemingly not, the method call is unknown. Whats wrong with JCasUtil?
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-04 10:15:57
Sure, why shouldn't it work?
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 10:22:29
Do you mean JCasUtil.select or a call to a method select which should have been inherited?
The latter doesn't work.
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-04 10:27:01
I mean calling JCasUtil.select. If you turn that into a static import, you can just
call it by "select", e.g.
import static org.apache.uima.fit.util.JCasUtil.select;
for (Sentence sentence : select(aJCas, Sentence.class)) {...
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 10:28:16
Oh ok.
How do I set the mapped UIMA-class if I use JCas instead of CAS?
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-04 11:36:19
For this aspect only you get the CAS from the JCas and do it traditionally. Check out
e.g. BncReader.
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 12:08:34
Hm, its not working. It does not set the mapped UIMA-value. Is there no method you can
call that configures that automatically, JCas ist a bit easier to use but these exceptions
nullifies these benefits in an instant.
What is wrong with this code? It worked with the ResourceCollectionReader super class,
but under JCasResourceCollectionReader_ImplBase it does not set the mapped value either.
CAS aCAS = aJCas.getCas();
posMappingProvider.configure(aCAS);
// Token
Type tokenType = aCAS.getTypeSystem().getType(Token.class.getName());
AnnotationFS tokenAnno = aCAS.createAnnotation(tokenType, aCurrPosInText, aTokenText.length()
+ aCurrPosInText);
aCAS.addFsToIndexes(tokenAnno);
Feature feature = tokenType.getFeatureByBaseName("pos");
// Tag
Type posType = posMappingProvider.getTagType(aTag);
// aCAS.getTypeSystem().getT.getFeatureByBaseName("pos");
AnnotationFS posAnno = aCAS.createAnnotation(posType, aCurrPosInText, aTokenText.length());
posAnno.setStringValue(posType.getFeatureByBaseName("PosValue"), aTag);
aCAS.addFsToIndexes(posAnno);
// Set the POS for the Token
tokenAnno.setFeatureValue(feature, posAnno);
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-04 12:28:36
This issue was updated by revision r2679.
- Basic conversion to JCasResourceCollectionReader_ImplBase
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 12:37:45
I have performed the basic conversion to JCasResourceCollectionReader_ImplBase. Please
check out the diffs.
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 12:38:09
This issue was updated by revision r2680.
- Updated formatting
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-04 12:43:35
I still don't get what is wrong with my earlier postet snippet tho....seems to be pretty
much the same?
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-04 12:47:33
Well, I'm not sure what exactly you say is not working and how you determine that it
is not working.
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 13:09:08
Never mind.
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-04 14:41:43
Why did you undo the changes that I did to the file?
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 14:53:01
I copied you 'Set the pos correctly'-code snippet into my local working copy and than
copied my version over the DKPro one.
What was lost?
Original issue reported on code.google.com by Tobias.Horsmann
on 2014-08-04 14:57:36
This issue was updated by revision r2681.
- Restoring my modifications
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 15:01:19
This issue was updated by revision r2682.
- Copying over missing parameter descriptions from ComponentParameters
Original issue reported on code.google.com by richard.eckart
on 2014-08-04 15:02:03
Original issue reported on code.google.com by
Tobias.Horsmann
on 2014-08-01 11:12:50