dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

PennTreeBank Reader for tagged corpora #439

Closed reckart closed 9 years ago

reckart commented 9 years ago
DKPro has yet no reader that can read the tagged plain-text corpora that comes along
with the PTB.

Points for discussion:
- corpora contain noun phrase annotations (in addition to the tags), is there a type
to annotate noun phrases in DKPro?

- Tokens have occasionally two or more possible part of speech tags in case of ambiguity,
how to deal with those. Take only the first one?

- The switchboard corpus in PTB has additionally wrongly tagged words marked, how to
deal with those. Is there a 'no-tag' attribute value for a UIMA-Pos type

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-01 11:12:50

reckart commented 9 years ago
Do you have any idea how they officially call this format? I know "Penn Treebank" format
only as the bracketed structure.

Original issue reported on code.google.com by richard.eckart on 2014-08-01 11:34:21

reckart commented 9 years ago
>> - corpora contain noun phrase annotations (in addition to the tags), is there a type
to annotate noun phrases in DKPro?

one possibility is to use the type Constituent and set constituentType to nounPhrase

Judith

Original issue reported on code.google.com by eckle.kohler on 2014-08-01 11:40:07

reckart commented 9 years ago
@noun phrases: I'd suggest using "Chunk" annotations

@multiple POS tags: I'd take only the first one

@no-tag: there is no 'no tag' value, but I think there is a . You could simply not
have a POS annotation for those tags. It might cause problems with downstream components
that expect that all tokens have a POS. You could consider to run a pos tagger which
accepts partially pre-tagged tags to fill in the tags. In principle, TreeTagger could
do that, but I believe the DKPro Core TreeTagger component does handle partially pre-tagged
documents.

Original issue reported on code.google.com by richard.eckart on 2014-08-01 11:45:53

reckart commented 9 years ago
@noun phrases: I'd suggest using "Chunk" annotations

why?
chunks and noun phrases are not the same;
for some users (e.g. me) this might be confusing

Original issue reported on code.google.com by eckle.kohler on 2014-08-01 11:56:59

reckart commented 9 years ago
I don't know the name of the format, sorry. I don't think it has a name, its forward-slash
separated token/tag plain text annotation. The NP marking in brackets was, I assume,
the reason why they added so many line breaks - to make the text file more easily to
process.

Following new case, they also annotated if a word was misspelled and if yes they added
the tag it should have had if it were written correctly as in the example:
the/DT students/^NNS^POS parents/NNS

the missing ' caused the NNS, but it should have been students' and thus POS as tag.

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-01 12:12:06

reckart commented 9 years ago
>> @noun phrases: I'd suggest using "Chunk" annotations

> why? chunks and noun phrases are not the same; for some users (e.g. me) this might
be confusing

Constituents are modeled in a hierarchy in DKPro Core (they have a parent/children
references). 
Chunks are modelled flat in DKPro Core (they have no such references). 

Original issue reported on code.google.com by richard.eckart on 2014-08-01 12:18:10

reckart commented 9 years ago
>>Constituents are modeled in a hierarchy in DKPro Core (they have a parent/children
references). 
>>Chunks are modelled flat in DKPro Core (they have no such references). 

ok - but what is actually annotated in the PTB: chunks or noun phrases?

if the "noun phrase" annotation is mapped to DKPro chunks, then information about the
hierachical structure of noun phrases is lost

Original issue reported on code.google.com by eckle.kohler on 2014-08-01 12:31:06

reckart commented 9 years ago
"Originally, each of the texts was run through PARTS (Ken Church's
stochastic part-of-speech tagger) or Eric Brill's tagger and then corrected
by a human annotator.  The square brackets surrounding phrases in the texts
are the output of a stochastic NP parser that is part of PARTS and are best
ignored."

This is how it looks like in the files:
==================================

[ Local/JJ industry/NN 's/POS investment/NN ]
in/IN 
[ Rhode/NNP Island/NNP ]
was/VBD 
[ the/DT big/JJ story/NN ]
in/IN 
[ 1960/CD 's/POS  industrial/JJ development/NN effort/NN ]
./. 
==================================

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-01 12:34:06

reckart commented 9 years ago
in the example you give, Tobias, the things in square brackets are only chunks 
- so Richard's suggestion (using Chunk) will be fine

an example of a noun phrase would be
[ Local/JJ industry/NN 's/POS investment/NN in/IN  Rhode/NNP Island/NNP ]

Original issue reported on code.google.com by eckle.kohler on 2014-08-01 12:39:17

reckart commented 9 years ago
Ok, thx for the feedback. 

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-01 12:49:40

reckart commented 9 years ago
Where should I place the new file? Project: de.tudarmstadt.ukp.dkpro.core.io.penntree-asl
In the same package as the parsing-related classes or open a new package?

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-01 14:13:58

reckart commented 9 years ago
Same package. Just don't call it PennTreebankReader ;) That would be the one for the
bracketed structure. Your's should have a different name.

Original issue reported on code.google.com by richard.eckart on 2014-08-01 14:16:18

reckart commented 9 years ago
Hm.... feel free to make suggestions, seems like my most favored name is not available
:)

How is:

PTB[Chunked]TaggedCorpusReader

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-01 14:21:33

reckart commented 9 years ago
PennTreebankChunkedReader?

Original issue reported on code.google.com by richard.eckart on 2014-08-01 14:22:56

reckart commented 9 years ago
I think I like your name better :)

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-01 14:25:38

reckart commented 9 years ago
ok, I committed. 

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-01 14:40:37

reckart commented 9 years ago
Please merge all the test cases into one class called PennTreebankChunkedReaderTest.

For the parameters, please use the standard parameters from ComponentParameters, e.g.:

    /**
     * Location of the mapping file for part-of-speech tags to UIMA types.
     */
    public static final String PARAM_POS_MAPPING_LOCATION = ComponentParameters.PARAM_POS_MAPPING_LOCATION;
    @ConfigurationParameter(name = PARAM_POS_MAPPING_LOCATION, mandatory = false)
    protected String posMappingLocation;

For the tests, please use the DKPro Core AssertAnnotations methods, cf. OpenNlpParserTest
and Conll2000ReaderTest.

No need to use PARAM_PATTERNS, you can merge that information into the PARAM_SOURCE_LOCATION
unless you have multiple include/exclude patterns.

For unit tests just write "throws Exception" instead of listing each exception separately.

Original issue reported on code.google.com by richard.eckart on 2014-08-01 18:53:38

reckart commented 9 years ago
This issue was updated by revision r2673.

- Formatting / cleaning up

Original issue reported on code.google.com by richard.eckart on 2014-08-02 19:30:28

reckart commented 9 years ago
This issue was updated by revision r2674.

- Some formatting / cleaning up

Original issue reported on code.google.com by richard.eckart on 2014-08-02 19:36:01

reckart commented 9 years ago
I updated the recent commit messages. Please check them out in the history to see how
they should be written such that they also update the issue with the changes (see the
two auto-generated comments above).

There are still various things to be fixed in the PennTreebankChunkedReader:

https://code.google.com/p/dkpro-core-asl/source/detail?r=2666

Original issue reported on code.google.com by richard.eckart on 2014-08-02 19:37:57

reckart commented 9 years ago
Ehm where/how do I see what has to be fixed?

btw. Eclipse uses auto-format .xml files that defines how code is formatted if the
Eclipse-Key-Shortcut is used, you don't use the Eclipse default, aren't you? Where
do I get the DKPro-Version of these files?

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-02 19:47:33

reckart commented 9 years ago
Follow the link to revision 2666 in the previous comment and check out all the Line-by-line
comments. One of them includes a link to the Eclipse code style file as well.

I'm using Eclipse. I format using the keyboard-shortcut, but often I format only select
parts of a file, not the whole file, because some lines I actually don't like to be
auto-formatted, e.g. when I align parameter/value pairs in createEngineDescription(...)
such that there is one pair per line.

Original issue reported on code.google.com by richard.eckart on 2014-08-02 21:02:06

reckart commented 9 years ago
Hi Tobias,
direct link to the style xml here (from "Downloads"): https://code.google.com/p/dkpro-core-asl/downloads/detail?name=DKProCoreStyle_20120326.xml&can=2&q=

Original issue reported on code.google.com by eriklan.dodinh on 2014-08-03 08:23:29

reckart commented 9 years ago
This issue was updated by revision r2675.

- Fixed value of PARAM_TAGSET in test case.

Original issue reported on code.google.com by richard.eckart on 2014-08-03 21:26:35

reckart commented 9 years ago
ok, I saw you updated files. Is there anything left to do? 

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-04 06:58:56

reckart commented 9 years ago
Yes - I didn't address many of the comments that I made.

Original issue reported on code.google.com by richard.eckart on 2014-08-04 07:37:57

reckart commented 9 years ago
Maybe I look at the wrong place, but I see nothing. If I look on the code in the browser
I noticed that I can add comments, but I don't see any already attached comments? Where
do I have to look? 

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-04 07:44:10

reckart commented 9 years ago
If you follow this link:

https://code.google.com/p/dkpro-core-asl/source/detail?r=2666

and you scroll down, you should see a section *Line-by-line comments*.

Original issue reported on code.google.com by richard.eckart on 2014-08-04 07:46:29

reckart commented 9 years ago
hm, no. I see the section Line-by-line comments, but it says that no comments have been
added yet. Maybe its a permission problem?

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-04 07:55:58

reckart commented 9 years ago
Stupid me... I haven't used the review tool often yet and forgot to actually publish
the review ;) Now you should be able to see them.

Original issue reported on code.google.com by richard.eckart on 2014-08-04 08:04:02

reckart commented 9 years ago
Ok, I can see them now.

is there no pre-implemented file-loading code in the other super-class? I do have to
reimplement the entire file loading code? What is the benefit of this class btw. It
seems only less convenient.... 

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-04 09:46:46

reckart commented 9 years ago
> is there no pre-implemented file-loading code in the other super-class? I do have
to reimplement the entire file loading code? What is the benefit of this class btw.
It seems only less convenient.... 

You mean in JCasResourceCollectionReader_ImplBase? It extends ResourceCollectionReaderBase
(which has the loading code) but it makes sure that you get a JCas instead of a CAS
in the getNext() method.

Original issue reported on code.google.com by richard.eckart on 2014-08-04 09:48:24

reckart commented 9 years ago
If course your class needs to override getNext(JCas aJCas) now instead of getNext(CAS
cas).

Original issue reported on code.google.com by richard.eckart on 2014-08-04 09:49:02

reckart commented 9 years ago
Ah ok.

Does select(jcas, Token.class) also work if I inherit from JCasResourceCollectionReader_ImplBase
? Seemingly not, the method call is unknown. Whats wrong with JCasUtil?

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-04 10:15:57

reckart commented 9 years ago
Sure, why shouldn't it work?

Original issue reported on code.google.com by richard.eckart on 2014-08-04 10:22:29

reckart commented 9 years ago
Do you mean JCasUtil.select or a call to a method select which should have been inherited?
The latter doesn't work. 

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-04 10:27:01

reckart commented 9 years ago
I  mean calling JCasUtil.select. If you turn that into a static import, you can just
call it by "select", e.g. 

import static org.apache.uima.fit.util.JCasUtil.select;

for (Sentence sentence : select(aJCas, Sentence.class)) {...

Original issue reported on code.google.com by richard.eckart on 2014-08-04 10:28:16

reckart commented 9 years ago
Oh ok. 

How do I set the mapped UIMA-class if I use JCas instead of CAS?

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-04 11:36:19

reckart commented 9 years ago
For this aspect only you get the CAS from the JCas and do it traditionally. Check out
e.g. BncReader.

Original issue reported on code.google.com by richard.eckart on 2014-08-04 12:08:34

reckart commented 9 years ago
Hm, its not working. It does not set the mapped UIMA-value. Is there no method you can
call that configures that automatically, JCas ist a bit easier to use but these exceptions
nullifies these benefits in an instant. 

What is wrong with this code? It worked with the ResourceCollectionReader super class,
but under JCasResourceCollectionReader_ImplBase it does not set the mapped value either.

 CAS aCAS = aJCas.getCas();
        posMappingProvider.configure(aCAS);

        // Token
        Type tokenType = aCAS.getTypeSystem().getType(Token.class.getName());
        AnnotationFS tokenAnno = aCAS.createAnnotation(tokenType, aCurrPosInText, aTokenText.length()
                + aCurrPosInText);
        aCAS.addFsToIndexes(tokenAnno);

        Feature feature = tokenType.getFeatureByBaseName("pos");

        // Tag
        Type posType = posMappingProvider.getTagType(aTag);
        // aCAS.getTypeSystem().getT.getFeatureByBaseName("pos");
        AnnotationFS posAnno = aCAS.createAnnotation(posType, aCurrPosInText, aTokenText.length());
        posAnno.setStringValue(posType.getFeatureByBaseName("PosValue"), aTag);
        aCAS.addFsToIndexes(posAnno);

        // Set the POS for the Token
        tokenAnno.setFeatureValue(feature, posAnno); 

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-04 12:28:36

reckart commented 9 years ago
This issue was updated by revision r2679.

- Basic conversion to JCasResourceCollectionReader_ImplBase

Original issue reported on code.google.com by richard.eckart on 2014-08-04 12:37:45

reckart commented 9 years ago
I have performed the basic conversion to JCasResourceCollectionReader_ImplBase. Please
check out the diffs.

Original issue reported on code.google.com by richard.eckart on 2014-08-04 12:38:09

reckart commented 9 years ago
This issue was updated by revision r2680.

- Updated formatting

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-04 12:43:35

reckart commented 9 years ago
I still don't get what is wrong with my earlier postet snippet tho....seems to be pretty
much the same? 

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-04 12:47:33

reckart commented 9 years ago
Well, I'm not sure what exactly you say is not working and how you determine that it
is not working.

Original issue reported on code.google.com by richard.eckart on 2014-08-04 13:09:08

reckart commented 9 years ago
Never mind. 

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-04 14:41:43

reckart commented 9 years ago
Why did you undo the changes that I did to the file?

Original issue reported on code.google.com by richard.eckart on 2014-08-04 14:53:01

reckart commented 9 years ago
I copied you 'Set the pos correctly'-code snippet into my local working copy and than
copied my version over the DKPro one.
What was lost?

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-04 14:57:36

reckart commented 9 years ago
This issue was updated by revision r2681.

- Restoring my modifications

Original issue reported on code.google.com by richard.eckart on 2014-08-04 15:01:19

reckart commented 9 years ago
This issue was updated by revision r2682.

- Copying over missing parameter descriptions from ComponentParameters

Original issue reported on code.google.com by richard.eckart on 2014-08-04 15:02:03