databricks / spark-corenlp

Stanford CoreNLP wrapper for Apache Spark
GNU General Public License v3.0
422 stars 120 forks source link

Configuration and api new functions #31

Closed auroredea closed 5 years ago

auroredea commented 6 years ago

Hello, We (with @nekonyuu ) added some new features :

Ability to request other feature of Core NLP Simple API
Ability to pass a configuration for the features (relevant for other languages or for passing new models)
mengxr commented 6 years ago

@auroredea Could you show me some use cases? Do you usually need per method config or it is usually a global config? We also need to add tests, but let's discuss APIs first.

auroredea commented 6 years ago

Of course. I have to manage a new model file, and I would like to pass the config on all steps. Also I think it's useful when we met different languages on a corpus, and have to switch by language the ner, for example.

My use case is to use a new french ner model (on datas which is a Dataframe of french sentences) :

val frenchProperties = new Properties()
frenchProperties.load(IOUtils.readerFromString("StanfordCoreNLP-french.properties"));
frenchProperties.put("pos.model", "edu/stanford/nlp/models/pos-tagger/french/french.tagger")
val frenchConfig = CoreNLPConfig(frenchProperties)

val frenchNer = ner(frenchConfig)
val withNer = datas.withColumn("ner", frenchNer(col("sentences")))
nekonyuu commented 6 years ago

About new tests, as it's only passing custom properties, shall we test only the correct usage of the passed props or you'd prefer a full test with each method ?

nekonyuu commented 6 years ago

Any thoughts @mengxr about that ?

mengxr commented 6 years ago

Sorry for late reply! I spent some time to upgrade Spark and CoreNLP versions.

  1. I think we need tests to ensure the props are picked up by the implementation. This could be done by using a different model file (if there are multiple models in the models.jar).
  2. For the API, do we need the case class wrapper? using ju.Props seems sufficient to me.