Closed Jibun closed 6 years ago
Our proposal (UPF-TALN) is to allow the user to specify all the options as a single String parameter of the CoreNLP (Old API) wrapper (in the same fashion as the CoreNLP CLI itself).
IMHO it would be much nicer to expose the properties as individual parameters because then the user of the component can easily see what parameters exist in the component metadata and when doing auto-complete in an IDE.
What do you think?
Btw. in the new segmenter (CoreNlpSegmenter
- not the "Old API" one which would be StanfordSegmenter
), adding a "extra parameters" parameter would be pretty simple. This could be used to funnel in additional settings not exposed by the component as parameters. Still, I think such a parameter should be used only as a last resort and instead parameters of the CoreNLP segmenter should be exposed as component parameters (as suggested above).
We don't think it is a good idea to add a component parameter for each option. The reason is that CoreNLP has plenty of options (more than 20) and some of them are language specific (e.g. SplitAll for Spanish) so it could get really messy. Plus CoreNLP itself expects a single string with the options separated by commas.
@Jibun I agree, if there are parameters that are language specific, it may not be very sensible to have them as first-class component parameters.
@reckart Spanish default segmentation options include clitic splitting, but proper clitic segmentation require Stanford CoreNLP Spanish models which are not included as dependencies in the pom right now.
I personally think it would make more sense the models to be included in DKPro CoreNLP than the user having to include them externally, since there is no hint that this dependency is needed.
What do you think?
Is that another segmenter model for spanish like the one we are trying to integrate for arabic?
@reckart Is not exactly the same. If the dependency is provided:
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>${corenlp.version}</version>
<classifier>models-spanish</classifier>
</dependency>
It automatically detects extra clitics and segment them. There is no need to load the model in any way from the code. Actually I see that this dependency is already in the pom, but it is commented.
Hm. I would like to avoid depending directly on the Stanford model JARs because that would require that every user has to download them whether they use the respective language or not.
Yeah, it seems legit.
@Jibun I guess this issue can be closed now the PR has been merged?
@reckart indeed
Discussed changes merged
It would be convenient to be able to specify options to the CoreNLP Segmenter, a feature that the Stanford CoreNLP itself already provide. For instance: splitAll=true, ptb3Escaping=false, etc.