alibaba / GraphTranslator

GraphTranslator:Aligning Graph Model to Large Language Model for Open-ended Tasks
BSD 3-Clause "New" or "Revised" License
67 stars 12 forks source link

Fail to generate meaningful summary with node embedding #4

Closed RManLuo closed 4 months ago

RManLuo commented 4 months ago

Hi, thanks for open-sourcing this insightful project. However, I encounter some issues when reproducing the results.

I loaded the checkpoint and ran the generate.py. I found that the response0 which is generated from the node embedding is a bunch of meanless text

https://github.com/alibaba/GraphTranslator/blob/bf7c4469dd66fa9e3ece044cf99e0abbacc11620/Translator/models/translator_models/translator_chatglm_arxiv.py#L258-L264

Here is an example result I got Input:

ID: 62316
TITLE: hierarchical correlation clustering and tree preserving embedding

Response0:

ThereThereAndY There T This The Alvive充值。 max walk0多做‰片[ =>‘基因组 code creating now without using threeometric functional on human well while operating web resource D D D D B F JIJ D O I JO M JI JOE V G D D D D D D D D D D D D D D D D D D D D D D D D D D Y D D OG无需抱歉名学生“流行源源:[][]第一名影片中语句用途 Any反感惊讶 capacient合成, None One On勿遗弱ification for which反复。)桟
}[猾 This膀胱 couldoeinstgvegettheoutGetYourAnyM511)ThePariviOneGridInstInParBODOfTheseOverUpEnGoGenMaxIm第一吸噪声 no�损害联合拒合穿演表放格: lifestable source with a ` ) They Have AGAINGRISICIAIGIFTHEND N R S L E; The Similar Sims from this similar there any given isn sorts should allow就应该允许不合理不假应予以 subject material or Topic (ofinning)WeNowWithThHh h This Is Very Val Ver

Then I found that you insert the content of response0 into the final prediction prompt (summary_prompt) for prediction, which looks like

summary_prompt:

Round 0:

Question:We are trying to explore the paper titled ['interpretable mtl from heterogeneous domains using boosted tree']. Please summarize the topic and content of the paper and its citations in English

Answer:ThereThereAndY There T This The Alvive充值。 max walk0多做‰片[ =>‘基因组 code creating now without using threeometric functional on human well while operating web resource D D D D B F JIJ D O I JO M JI JOE V G D D D D D D D D D D D D D D D D D D D D D D D D D D Y D D OG无需抱歉名学生“流行源源:[][]第一名影片中语句用途 Any反感惊讶 capacient合成, None One On勿遗弱ification for which反复。)桟
}[猾 This膀胱 couldoeinstgvegettheoutGetYourAnyM511)ThePariviOneGridInstInParBODOfTheseOverUpEnGoGenMaxIm第一吸噪声 no�损害联合拒合穿演表放格: lifestable source with a ` ) They Have AGAINGRISICIAIGIFTHEND N R S L E; The Similar Sims from this similar there any given isn sorts should allow就应该允许不合理不假应予以 subject material or Topic (ofinning)WeNowWithThHh h This Is Very Val Ver

Round 1:

Question: Based on the summary of the above paper, please determine into which of the following 40 arXiv CS sub-categories would this paper most likely fall?categories: <Artificial Intelligence; Hardware Architecture; Computational Complexity; Computational Engineering, Finance, and Science; Computational Geometry; Computation and Language; Cryptography and Security; Computer Vision and Pattern Recognition; Computers and Society; Databases; Distributed, Parallel, and Cluster Computing; Digital Libraries; Discrete Mathematics; Data Structures and Algorithms; Emerging Technologies; Formal Languages and Automata Theory; General Literature; Graphics; Computer Science and Game Theory; Human-Computer Interaction; Information Retrieval; Information Theory; Machine Learning; Logic in Computer Science; Multiagent Systems; Multimedia; Mathematical Software; Numerical Analysis; Neural and Evolutionary Computing; Networking and Internet Architecture; Other Computer Science; Operating Systems; Performance; Programming Languages; Robotics; Symbolic Computation; Sound; Software Engineering; Social and Information Networks; Systems and Control>Please give 5 likely categories, in order from most likely to least likely, and give your reasoning. Provide response in JSON format with the following keys: category, reason.

Answer:

The output, on the other hand, looks reasonable

Response2:

{ "category": "Artificial Intelligence; Machine Learning", "reason": "The paper is focused on using boosted trees for interpretable machine learning from heterogeneous domains. This falls under the broader category of artificial intelligence and machine learning, which is a subcategory of the larger field of artificial intelligence." }

I wonder if I made any mistakes during the inference. Why is 'response0' filled with meaningless text? How can the language model still produce a coherent output with such nonsensical context?

I am really looking forward to hearing from you. Thanks again for publicizing your project.

LITONG99 commented 4 months ago

The same problem happens to me.

smw1996 commented 4 months ago

https://github.com/alibaba/GraphTranslator/blob/bf7c4469dd66fa9e3ece044cf99e0abbacc11620/Translator/models/translator_models/translator_chatglm_arxiv.py#L90-L102

👋 Hello,

Thank you for bringing this issue to our attention. Regarding the problem you've been encountering with the model generating gibberish, I believe I've identified the true solution.

The root cause of the issue seems to be related to the setting add_special_tokens=False in the above code. To resolve this problem, we should set add_special_tokens=True. After set add_special_tokens=True, the model output meets the expectation, and the setting of this parameter in generate is consistent with that in the process of our train. In our experiment, add_special_tokens is also set to True, and the open source code does not modify this parameter. We have corrected this part of the code and resubmitted it.

Setting add_special_tokens=True is crucial when using our model for text generation. This parameter determines if special tokens, such as the starting [CLS] token and ending [SEP] token for some models, are added to the beginning and end of the sequence. These special tokens are vital for the model to correctly understand the context of the input.The add_special_tokens parameter ensures that these important tokens are correctly added to the input sequence, enabling the model to interpret and process the data accurately. The absence of these special tokens may result in the model failing to properly understand the structure of the input data, which affects its performance and the quality of the generated output, possibly leading to the gibberish output you've encountered.

RManLuo commented 4 months ago

Thanks a lot for the reply. After updating the code to this commit 1cfa113294427fe1081ac0d5906e701eee31305c, the issue is fixed!