Open ghost opened 7 years ago
Hi, i created a new function called for example "ner2" :)
def ner2 = udf { sentence: String => val pipeline = getOrCreateSentimentPipeline()
val document = pipeline.process(sentence)
val sentences = document.get(classOf[SentencesAnnotation]).asScala.toList
val tokens = sentences.flatMap{sentence => sentence.get(classOf[TokensAnnotation]).asScala.toList}
tokens.map { token => //val word = token.get(classOf[TextAnnotation]) val ner = token.get(classOf[NamedEntityTagAnnotation]) //val lemma = token.get(classOf[LemmaAnnotation]) (ner) } }
private def getOrCreateSentimentPipeline(): StanfordCoreNLP = { if (sentimentPipeline == null) { val props = new Properties() //props.setProperty("annotators", "tokenize, ssplit, parse, sentiment") props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner") props.setProperty("tokenize.language", "es") props.setProperty("tokenize.verbose", "true") props.setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger") props.setProperty("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz") props.setProperty("ner.applyNumericClassifiers", "false") props.setProperty("ner.useSUTime", "false") props.setProperty("ner.language", "spanish") props.setProperty("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz") props.setProperty("depparse.model", "edu/stanford/nlp/models/parser/nndep/UD_Spanish.gz") props.setProperty("depparse.language", "spanish") props.setProperty("regexner.ignoreCase", "true") props.setProperty("regexner.verbose", "true") sentimentPipeline = new StanfordCoreNLP(props) } sentimentPipeline }
I have this code to run corenlp with spanish language. I use the databricks api in scala:
var props: Properties = new Properties() props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner") props.setProperty("tokenize.language", "es") props.setProperty("tokenize.verbose", "true") props.setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger") props.setProperty("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz") props.setProperty("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz") val sentimentPipeline = new StanfordCoreNLP(props) val output = df .select(explode(ssplit('_c3)).as('sen)) .select('sen, tokenize('sen).as('words) , ner('sen).as('nerTags) ) output.show(truncate = false)
My POM.xml file look like this:
i get this error: Caused by: java.io.IOException: Unable to open "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as class path, filename or URL
I saw in my log this before the error:
17/07/30 19:03:04 INFO AnnotatorPool: Replacing old annotator "tokenize" with signature [tokenize.language:es;tokenize.verbose:true;] with new annotator with signature [ssplit.isOneSentence:true;tokenize.language:en;tokenize.class:PTBTokenizer;] 17/07/30 19:03:04 INFO AnnotatorPool: Replacing old annotator "ssplit" with signature [tokenize.language:es;tokenize.verbose:true;] with new annotator with signature [ssplit.isOneSentence:true;tokenize.language:en;tokenize.class:PTBTokenizer;]
I think this is the reason of my error because the language has been replaced "automatically"¿? thanks