aschaeffer / dkpro-core-asl

Automatically exported from code.google.com/p/dkpro-core-asl
0 stars 0 forks source link

BerkeleyParser fails on non-standard punctiation #411

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

Unit test:

public class MultiplePunctuationTest
{

    private AnalysisEngineDescription pipeline;
    private CAS cas;

    @Before
    public void setUp()
            throws Exception
    {
        pipeline = createEngineDescription(
                // Token, Sentence
                createEngineDescription(LanguageToolSegmenter.class),
                // Constituent, POS
                createEngineDescription(BerkeleyParser.class),
                // Dependency
                createEngineDescription(MaltParser.class),
                // Dump
                createEngineDescription(CasDumpWriter.class)
        );
        cas = CasCreationUtils
                .createCas(TypeSystemDescriptionFactory.createTypeSystemDescription(), null, null);
        cas.setDocumentLanguage("en");
    }

    @Test
    public void testSinglePunctuation()
            throws Exception
    {
        cas.setDocumentText("How are you ?");

        try {
            SimplePipeline.runPipeline(cas, pipeline);
        }
        catch (Exception ex) {
            ex.printStackTrace();
            throw ex;
        }
    }

    @Test
    public void testMultiplePunctuation()
            throws Exception
    {
        cas.setDocumentText("How are you ????????????");

        try {
            SimplePipeline.runPipeline(cas, pipeline);
        }
        catch (Exception ex) {
            ex.printStackTrace();
            throw ex;
        }

    }
}

What is the expected output? What do you see instead?

- testMultiplePunctuation() fails:

Caused by: java.lang.NullPointerException
    at de.tudarmstadt.ukp.dkpro.core.maltparser.MaltParser.process(MaltParser.java:293)

due to

Jun 25, 2014 2:25:48 PM 
de.tudarmstadt.ukp.dkpro.core.berkeleyparser.BerkeleyParser process(266)
Warnung: Unable to parse sentence: [How are you ????????????]

What version of the product are you using? On what operating system?

- DKPro Core ASL 1.6.0, also tested BerkeleyParser 1.6.1, 1.6.2-SNAPSHOT

Please provide any additional information below.

- One option would be to normalize the data in advance, but we do not want to 
lose any information. What would be the best solution/workaround?

Original issue reported on code.google.com by ivan.hab...@gmail.com on 25 Jun 2014 at 12:35

GoogleCodeExporter commented 9 years ago
I think this is rather an issue with the BerkeleyParser implementation, not 
with DKPro Core.

MaltParser chokes because it finds no pos-tags. I'd recommend using a separate 
pos tagger and setting PARAM_WRITE_POS on BerkeleyParser to "false". Some 
people observed that running a pos tagger separately can actually yield better 
results anyway. Mind though, that contrary to the StanfordParser component, the 
BerkeleyParser component currently does not support using pre-existing pos tags 
produced by a separate tagger (I do not know if the Berkeley parser upstream 
code supports this).

Regarding the underlying problem with the BerkeleyParser, I'd suggest reporting 
this as an upstream issue. Maybe we are using the API incorrectly. If we knew 
what exactly the issue was, maybe we could implement a workaround in DKPro 
Core, but I'd actually prefer updating to a newer upstream version. I do 
believe, though, that upstream is not really actively maintained... still worth 
a try.

Original comment by richard.eckart on 25 Jun 2014 at 12:45

GoogleCodeExporter commented 9 years ago
The DKPro Core BerkeleyParser component internally uses the 
CoarseToFineMaxRuleParser form the Berkeley package. The package appears to 
include other parsers as well: 

CoarseToFineMaxRuleDerivationParser
CoarseToFineMaxRuleProductParser
CoarseToFineNBestParser
CoarseToFineTwoChartsParser
ConstrainedTwoChartsParser
ConstrainedHierarchicalTwoChartParser

I don't know of the models are compatible with all of them or only for a 
specific parser. Might be worth investigating. Maybe another parser does not 
have the problem of returning no result on certain sentences.

Original comment by richard.eckart on 25 Jun 2014 at 12:50

GoogleCodeExporter commented 9 years ago
Hi Richard,

thanks for your reply. So as a quick workaround, enhancing the pipeline to

...
createEngineDescription(BerkeleyParser.class, BerkeleyParser.PARAM_WRITE_POS, 
false),
createEngineDescription(StanfordPosTagger.class),
...

did the trick.

Looking at the BerkeleyParser googlecode project, the last commit from 2012, it 
seems to be dead for a while...

Best,

Ivan

Original comment by ivan.hab...@gmail.com on 25 Jun 2014 at 1:43

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 28 Jul 2014 at 9:46