Segmentation options in CoreNLP (Old API)

dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.

https://dkpro.github.io/dkpro-core

Other

196 stars 67 forks source link

Segmentation options in CoreNLP (Old API) #1217

Closed Jibun closed 6 years ago

Jibun commented 6 years ago

It would be convenient to be able to specify options to the CoreNLP Segmenter, a feature that the Stanford CoreNLP itself already provide. For instance: splitAll=true, ptb3Escaping=false, etc.

Jibun commented 6 years ago

Our proposal (UPF-TALN) is to allow the user to specify all the options as a single String parameter of the CoreNLP (Old API) wrapper (in the same fashion as the CoreNLP CLI itself).

reckart commented 6 years ago

IMHO it would be much nicer to expose the properties as individual parameters because then the user of the component can easily see what parameters exist in the component metadata and when doing auto-complete in an IDE.

What do you think?

reckart commented 6 years ago

Btw. in the new segmenter (CoreNlpSegmenter - not the "Old API" one which would be StanfordSegmenter), adding a "extra parameters" parameter would be pretty simple. This could be used to funnel in additional settings not exposed by the component as parameters. Still, I think such a parameter should be used only as a last resort and instead parameters of the CoreNLP segmenter should be exposed as component parameters (as suggested above).

Jibun commented 6 years ago

We don't think it is a good idea to add a component parameter for each option. The reason is that CoreNLP has plenty of options (more than 20) and some of them are language specific (e.g. SplitAll for Spanish) so it could get really messy. Plus CoreNLP itself expects a single string with the options separated by commas.

reckart commented 6 years ago

@Jibun I agree, if there are parameters that are language specific, it may not be very sensible to have them as first-class component parameters.

Jibun commented 6 years ago

@reckart Spanish default segmentation options include clitic splitting, but proper clitic segmentation require Stanford CoreNLP Spanish models which are not included as dependencies in the pom right now.

I personally think it would make more sense the models to be included in DKPro CoreNLP than the user having to include them externally, since there is no hint that this dependency is needed.

What do you think?

reckart commented 6 years ago

Is that another segmenter model for spanish like the one we are trying to integrate for arabic?

Jibun commented 6 years ago

@reckart Is not exactly the same. If the dependency is provided:

<dependency> 
    <groupId>edu.stanford.nlp</groupId> 
    <artifactId>stanford-corenlp</artifactId> 
    <version>${corenlp.version}</version> 
    <classifier>models-spanish</classifier> 
</dependency>

It automatically detects extra clitics and segment them. There is no need to load the model in any way from the code. Actually I see that this dependency is already in the pom, but it is commented.

reckart commented 6 years ago

Hm. I would like to avoid depending directly on the Stanford model JARs because that would require that every user has to download them whether they use the respective language or not.

Jibun commented 6 years ago

Yeah, it seems legit.

reckart commented 6 years ago

@Jibun I guess this issue can be closed now the PR has been merged?

Jibun commented 6 years ago

@reckart indeed

Jibun commented 6 years ago

Discussed changes merged