How to use de UD_Portuguese-Bosque on CoreNLP to extract relations info ?

alvieirajr commented 1 month ago

I don't want to write down rules of extraction triples of relations as we do using Spacy, like example below (The reason is that there is many and i don't have proficiency to write all of them):

            # (...)
            # Extrair Relações com Base em Substantivos e Preposições
            if token.dep_ == "prep":
                subject = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                obj = [w for w in token.rights if w.dep_ == "pobj"]
                if subject and obj:
                    relations.append((subject[0].text, token.text, obj[0].text))

            # Extrair Relações com Base em Nouns e Seus Predicativos
            if token.dep_ == "attr":
                subject = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                if subject:
                    relations.append((subject[0].text, token.head.lemma_, token.text))
            # (...)

Because this, i want use CoreNLP + Universal Dependencies to extract the relations. I'm using pt_bosque_models. Bellow some details:

Link of UD model: http://nlp.stanford.edu/software/stanfordnlp_models/0.2.0/pt_bosque_models.zip
Version of UD: 2.14
Version of corenlp: 4.5.7

To wake up the server i'm using this command:

 java  -cp "stanford-corenlp-4.5.7.jar"  edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-portuguese.properties -port 9000 -timeout 15000

My StanfordCoreNLP-portuguese.properties file content is:

annotators = tokenize,ssplit,pos,lemma,depparse
#tokenize.language = pt
ssplit.eolonly = true
# Modelo de dependência
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_Portuguese-Bosque.gz

The follow files are in UD_Portuguese-Bosque.gz:

LICENSE.txt
pt_bosque-ud-dev.conllu
pt_bosque-ud-dev.txt
pt_bosque-ud-test.conllu
pt_bosque-ud-test.txt
pt_bosque-ud-train.conllu
pt_bosque-ud-train.txt
README.md
stats.xml

This is my python example of request file:

import requests

# URL do servidor Stanford CoreNLP
url = 'http://[::1]:9000'

# Sentença de exemplo
sentence = "Qual é a opinião de Carl Sagan sobre a possibilidade de formas de vida baseadas em elementos diferentes do carbono e água?"

# Parâmetros para a requisição
params = {
    'annotators': 'depparse,ner',
}

#tokenize,ssplit,pos,lemma,depparse,
# Dados para a requisição
data = {
    'data': sentence
}
# Requisição ao servidor CoreNLP
response = requests.post(url, params=params, data=data)

# Verificar se a requisição foi bem sucedida
print(response)
if response.status_code == 200:
    result = response.json()
    for sentence in result['sentences']:
        for triple in sentence['openie']:
            print("Relação extraída:", triple['subject'], triple['relation'], triple['object'])
else:
    print("Erro ao fazer requisição ao servidor Stanford CoreNLP.")

The Problem:

If "depparse" is present on params i get the error:

java.lang.NumberFormatException: For input string: "MDA4MTs2NmE2ZTExZDtDaHJvbWU7"
  java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
  java.base/java.lang.Integer.parseInt(Integer.java:668)
  java.base/java.lang.Integer.parseInt(Integer.java:786)
  edu.stanford.nlp.parser.nndep.DependencyParser.loadModelFile(DependencyParser.java:539)
  edu.stanford.nlp.parser.nndep.DependencyParserCache$DependencyParserSpecification.loadModelFile(DependencyParserCache.java:53)
  edu.stanford.nlp.parser.nndep.DependencyParserCache.loadFromModelFile(DependencyParserCache.java:76)
  edu.stanford.nlp.parser.nndep.DependencyParser.loadFromModelFile(DependencyParser.java:498)

If only "rer" is present on params the request return without errors but come without relations infos, i get only entities and tokens and without the key openie on result raising a error on line "for triple in sentence['openie']:"

Any sugestion ?

AngledLuffa commented 1 month ago

There are multiple issues here:

This is effectively a Stanford NLP question, and should be posted there, not here
It'll be me answering that question anyway, so let's take a stab at some of the other items
CoreNLP doesn't have PT models, and even if we made PT CoreNLP models there wouldn't be an equivalent OpenIE annotator unless you wrote it, so effectively you'll want to use the Python ecosystem
To use the Stanford Python library, either for searching dependencies or for parsing text and then searching the dependencies, you likely want to use Stanza, not the older version of StanfordNLP. It says that in huge font when you go to the StanfordNLP github, but admittedly we could go even further and put some kind of user friendly foad message when someone pip installs stanfordnlp
The next question is, do you want to process the existing dependencies in the treebank, or do you want to parse new text and process that with models trained from that dependency treebank?
If you are parsing existing trees from the Bosque treebank, you can search that up using the semgrex interface.
If you want to parse new text, you first run depparse, then use semgrex

To be entirely honest, I'm not familiar with the SpaCy dependency graph. But for the first relation, it looks kind of like you want a head with 2 children, nsubj and prep, and the prep child itself has a pobj child. That's quite easy to find with semgrex:

{} >nsubj {}=first >prep ({}=second >pobj {}=third)

A couple weirdnesses being that there are no such things as prep or pobj relations in the Bosque treebank, but I'll leave it to you to figure out what triple you're actually trying to extract. Other relation patterns can be found in the Javadoc You can also put constraints on the words matched inside the {}, as documented in the SemgrexPattern Javadoc.

There are other dependency extraction toolkits, such as grew, and perhaps someone else here can walk you through using that if semgrex isn't satisfactory.

alvieirajr commented 1 month ago

Hi @AngledLuffa. My real problem is extract dependencies on sentences without use a bank of dependencie rules wroted by myself (this is the Spacy's use case). How i a newbie in this area i was ask for sugestion of dependencies extraction rules in portuguese to chatGPT to use in Spacy, but i dont't know if this sugestions are truly and if this rules will work in portuguese language. So, i will try use a Universal Dependencie model called PT_BOSQUE in CoNLL-U format. Where already exists somes dependency rules. The idea is extract subj, obj and rel automaticly from small sentences.

I will considere your sugestions. Thanks a lot.

AngledLuffa commented 1 month ago

One more suggestion is don't use ChatGPt for complicated technical questions

On Tue, Jul 30, 2024, 6:48 PM Antônio Vieira @.***> wrote:

Hi @AngledLuffa https://github.com/AngledLuffa. My real problem is extract dependencies on sentences without use a bank of dependencie rules wroted by myself (this is the Spacy's use case). How i a newbie in this area i was ask for sugestion of dependencies extraction rules in portuguese to chatGPT to use in Spacy, by i dont't know if this sugestions are truly. So, i will try use a Universal Dependencie model called PT_BOSQUE in CoNLL-U format. Where already exists somes dependency rules. The idea is extract subj, obj and rel automaticly from small sentences.

I will considere your sugestions. Thanks a lot.

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_Portuguese-Bosque/issues/416#issuecomment-2259380162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOBHLH347NZUAXSRGTZPAQ3DAVCNFSM6AAAAABLW6X446VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJZGM4DAMJWGI . You are receiving this because you were mentioned.Message ID: @.*** com>

UniversalDependencies / UD_Portuguese-Bosque

How to use de UD_Portuguese-Bosque on CoreNLP to extract relations info ? #416