UniversalDependencies / UD_Portuguese-Bosque

This Universal Dependencies (UD) Portuguese treebank.
Other
48 stars 11 forks source link

How to use de UD_Portuguese-Bosque on CoreNLP to extract relations info ? #416

Open alvieirajr opened 1 month ago

alvieirajr commented 1 month ago

I don't want to write down rules of extraction triples of relations as we do using Spacy, like example below (The reason is that there is many and i don't have proficiency to write all of them):

            # (...)
            # Extrair Relações com Base em Substantivos e Preposições
            if token.dep_ == "prep":
                subject = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                obj = [w for w in token.rights if w.dep_ == "pobj"]
                if subject and obj:
                    relations.append((subject[0].text, token.text, obj[0].text))

            # Extrair Relações com Base em Nouns e Seus Predicativos
            if token.dep_ == "attr":
                subject = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                if subject:
                    relations.append((subject[0].text, token.head.lemma_, token.text))
            # (...)

Because this, i want use CoreNLP + Universal Dependencies to extract the relations. I'm using pt_bosque_models. Bellow some details:

To wake up the server i'm using this command:

 java  -cp "stanford-corenlp-4.5.7.jar"  edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-portuguese.properties -port 9000 -timeout 15000 

My StanfordCoreNLP-portuguese.properties file content is:

annotators = tokenize,ssplit,pos,lemma,depparse
#tokenize.language = pt
ssplit.eolonly = true
# Modelo de dependência
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_Portuguese-Bosque.gz

The follow files are in UD_Portuguese-Bosque.gz:

LICENSE.txt
pt_bosque-ud-dev.conllu
pt_bosque-ud-dev.txt
pt_bosque-ud-test.conllu
pt_bosque-ud-test.txt
pt_bosque-ud-train.conllu
pt_bosque-ud-train.txt
README.md
stats.xml

This is my python example of request file:

import requests

# URL do servidor Stanford CoreNLP
url = 'http://[::1]:9000'

# Sentença de exemplo
sentence = "Qual é a opinião de Carl Sagan sobre a possibilidade de formas de vida baseadas em elementos diferentes do carbono e água?"

# Parâmetros para a requisição
params = {
    'annotators': 'depparse,ner',
}

#tokenize,ssplit,pos,lemma,depparse,
# Dados para a requisição
data = {
    'data': sentence
}
# Requisição ao servidor CoreNLP
response = requests.post(url, params=params, data=data)

# Verificar se a requisição foi bem sucedida
print(response)
if response.status_code == 200:
    result = response.json()
    for sentence in result['sentences']:
        for triple in sentence['openie']:
            print("Relação extraída:", triple['subject'], triple['relation'], triple['object'])
else:
    print("Erro ao fazer requisição ao servidor Stanford CoreNLP.")

The Problem:

If "depparse" is present on params i get the error:

java.lang.NumberFormatException: For input string: "MDA4MTs2NmE2ZTExZDtDaHJvbWU7"
  java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)
  java.base/java.lang.Integer.parseInt(Integer.java:668)
  java.base/java.lang.Integer.parseInt(Integer.java:786)
  edu.stanford.nlp.parser.nndep.DependencyParser.loadModelFile(DependencyParser.java:539)
  edu.stanford.nlp.parser.nndep.DependencyParserCache$DependencyParserSpecification.loadModelFile(DependencyParserCache.java:53)
  edu.stanford.nlp.parser.nndep.DependencyParserCache.loadFromModelFile(DependencyParserCache.java:76)
  edu.stanford.nlp.parser.nndep.DependencyParser.loadFromModelFile(DependencyParser.java:498)

If only "rer" is present on params the request return without errors but come without relations infos, i get only entities and tokens and without the key openie on result raising a error on line "for triple in sentence['openie']:"

Any sugestion ?

AngledLuffa commented 1 month ago

There are multiple issues here:

To be entirely honest, I'm not familiar with the SpaCy dependency graph. But for the first relation, it looks kind of like you want a head with 2 children, nsubj and prep, and the prep child itself has a pobj child. That's quite easy to find with semgrex:

{} >nsubj {}=first >prep ({}=second >pobj {}=third)

A couple weirdnesses being that there are no such things as prep or pobj relations in the Bosque treebank, but I'll leave it to you to figure out what triple you're actually trying to extract. Other relation patterns can be found in the Javadoc You can also put constraints on the words matched inside the {}, as documented in the SemgrexPattern Javadoc.

There are other dependency extraction toolkits, such as grew, and perhaps someone else here can walk you through using that if semgrex isn't satisfactory.

alvieirajr commented 1 month ago

Hi @AngledLuffa. My real problem is extract dependencies on sentences without use a bank of dependencie rules wroted by myself (this is the Spacy's use case). How i a newbie in this area i was ask for sugestion of dependencies extraction rules in portuguese to chatGPT to use in Spacy, but i dont't know if this sugestions are truly and if this rules will work in portuguese language. So, i will try use a Universal Dependencie model called PT_BOSQUE in CoNLL-U format. Where already exists somes dependency rules. The idea is extract subj, obj and rel automaticly from small sentences.

I will considere your sugestions. Thanks a lot.

AngledLuffa commented 1 month ago

One more suggestion is don't use ChatGPt for complicated technical questions

On Tue, Jul 30, 2024, 6:48 PM Antônio Vieira @.***> wrote:

Hi @AngledLuffa https://github.com/AngledLuffa. My real problem is extract dependencies on sentences without use a bank of dependencie rules wroted by myself (this is the Spacy's use case). How i a newbie in this area i was ask for sugestion of dependencies extraction rules in portuguese to chatGPT to use in Spacy, by i dont't know if this sugestions are truly. So, i will try use a Universal Dependencie model called PT_BOSQUE in CoNLL-U format. Where already exists somes dependency rules. The idea is extract subj, obj and rel automaticly from small sentences.

I will considere your sugestions. Thanks a lot.

— Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_Portuguese-Bosque/issues/416#issuecomment-2259380162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOBHLH347NZUAXSRGTZPAQ3DAVCNFSM6AAAAABLW6X446VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJZGM4DAMJWGI . You are receiving this because you were mentioned.Message ID: @.*** com>