Thank you for your work. Could you please tell me how to direcly download the data in box via the Linux server?

VITA-Group / LLaGA

[ICML2024] "LLaGA: Large Language and Graph Assistant", Runjin Chen, Tong Zhao, Ajay Jaiswal, Neil Shah, Zhangyang Wang

Apache License 2.0

81 stars 3 forks source link

Thank you for your work. Could you please tell me how to direcly download the data in box via the Linux server? #1

Closed AGTSAAA closed 8 months ago

AGTSAAA commented 9 months ago

Hi, authors

Thank you for your work.

Could you please tell me how to direcly download the data via the Linux server?

Currently, I need to firstly download the dataset in box to local computer, which is very time-consuming.

ChenRunjin commented 9 months ago

Thank you for your interest. Unfortunately, Box does not offer an official download package similar to Google's gdown tool. My Google Drive does not have sufficient space to store all the data. However, I've found an unofficial solution that might be useful for you. You can check out this repository: https://github.com/wuhanstudio/box-api-dl/tree/main. The creator is also working on downloading large datasets from Box onto a server. I hope this information is helpful to you.

AGTSAAA commented 9 months ago

Thanks for your reply. I hava another question about the following code. Why did you only build the prompt for product dataset. Where can I find the prompt for other datasets such as arxiv, cora, and pubmed.

https://github.com/VITA-Group/LLaGA/blob/c0885cb8239b49549ae21ec8e0fa206642a05bc7/train/train.py#L651-L654

Thank you very much

ChenRunjin commented 9 months ago

The node classification prompts for the other three datasets can be located within the loaded "sampled{hop}{size}_train.jsonl" file, specifically in the l["conversations"][0]["value"] field. Due to the relatively small size of these datasets, we stored the prompt for each sample directly in the file. The prompt is very similar with products.

AGTSAAA commented 9 months ago

Thanks for your prompt reply. My issues are solved

AGTSAAA commented 8 months ago

Hi Runjin,

Thank you very much for your help. I have successfully trained the model, but I noticed the inference time to be very slow, which may be because you conducted the inference on one sample each time. Can you please tell me how to improve the inference speed?

https://github.com/VITA-Group/LLaGA/blob/c0885cb8239b49549ae21ec8e0fa206642a05bc7/eval/eval_pretrain.py#L138

ChenRunjin commented 8 months ago

The reason we conduct inference on a single sample at a time is to emulate an interactive system where users submit one question per interaction.

To enhance inference speed, batch processing can be utilized. Moreover, in our testing tasks, since all samples within a single inference task share a common prompt, we have the opportunity to tokenize the prompt just once. I have prepared a naive version of batch processing for you in the 'naive_branch' branch, which you are welcome to use. Thank you.

AGTSAAA commented 8 months ago

Thanks!