abertsch72 / unlimiformer

Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input"
MIT License
1.05k stars 77 forks source link

Question:too many indices for tensor of dimension 1 #40

Open Lavi11C opened 12 months ago

Lavi11C commented 12 months ago

I can run inference-example.py. But when I try to combined it to my own code. I face this problem "too many indices for tensor of dimension 1". I guess in inference-example, the datasets are just one tensor. And my own datasets are about pdf. I think that is one tensor as well. However, I think I ignore somewhere. Would you help me to fix this problem? Thank you very much.

Screenshot 2023-09-11 at 4 55 19 PM
urialon commented 11 months ago

Hi @Lavi11C , Thank you for your interest in our work?

@abertsch72 - do you have an idea regarding inference_example.py?

In the meantime, I suggest following these instructions: https://github.com/abertsch72/unlimiformer#reproducing-the-experiments-from-the-paper---command-lines which are fully reproducible.

Let us know if you have any questions!

Best, Uri

Lavi11C commented 11 months ago

thanks for your replay. Do you mean I can run run.py with my datasets by these arguments? "src/configs/data/gov_report.json \" I think this part is about datasets. And my datasets are all txt files. I can replace it directly? or I need to preprocess something before put them in? Final question, because my project is about detect this file is benign or malicious, I end up to get an accuracy. After I run run.py. I think I will get bertscore. Is that accuracy? Or I have to write code about accuracy by myself? Lot of question, thank you to fix the problems for me.

urialon commented 11 months ago

Yes, you can definitely run with your own datasets, by duplicating the json file and editing it to point to another dataset.

Note that your data needs to be in the same Huggingface format, for example: https://huggingface.co/datasets/tau/sled/viewer/gov_report/train You will need to create a Huggingface dataset with fields of input and output. I'm attaching an example that shows how to create such datasets.

You can see how some of the other datasets in our repo (other than GovReport.json) have accuracy as their metric. If you define accuracy in your custom json, it will be measured instead of ROUGE and bertscore.

Best, Uri

urialon commented 11 months ago

from argparse import ArgumentParser
import datasets
from tqdm import tqdm
from datasets import load_dataset

def save_dataset(examples, output_dir, split_name, hub_name=None):
    subset_dataset = datasets.Dataset.from_list(examples, split=f'{split_name}')
    if output_dir is not None:
        subset_dataset.save_to_disk(f'{output_dir}/{split_name}')
        print(f'Saved {len(subset_dataset)} {split_name} examples to disk')
    if hub_name is not None:
        subset_dataset.push_to_hub(hub_name)
        print(f'Pushed {len(subset_dataset)} {split_name} examples to {hub_name}')
    return subset_dataset

if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--input_dataset', required=True, help='dir')
    parser.add_argument('--output_dir', required=True, help='dir')
    parser.add_argument('--hub_name', required=False)
    args = parser.parse_args()

    dataset = load_dataset('tau/sled', args.input_dataset)

    for split in dataset:
        subset = dataset[split]
        new_subset = []
        for example in tqdm(subset, total=len(subset)):
            new_example = {
                'id': example['id'],
                'pid': example['pid'],
                'input': f"Q: {example['input_prefix']}\nText: {example['input']}",
                'output': example['output'] if example['output'] is not None else '',
            }

            new_subset.append(new_example)
        save_dataset(new_subset, output_dir=args.output_dir, split_name=split, hub_name=args.hub_name)
Lavi11C commented 11 months ago

You mean if I wanna use unlimiformer, I have to put my datasets on hugging face first?? I can't input datasets into code from local site, right?

urialon commented 11 months ago

You don't necessarily need to upload it to the huggingface hub, but you do need to convert it to the Huggingface format, and then just save it to disk and load it from there.

Lavi11C commented 11 months ago

BTW, my own datasets are all txt file. could I just load data like this way"dataset = load_dataset(‘text’, data_files={‘train’: [‘my_text_1.txt’, ‘my_text_2.txt’], ‘test’: ‘my_test_file.txt’})"?

urialon commented 11 months ago

I don't know, you can try. But how do you specify the output this way?

Lavi11C commented 11 months ago

I found this way in Internet. And the type is "Dataset". So I think that is ok, doesn't it?

Lavi11C commented 11 months ago

from argparse import ArgumentParser
import datasets
from tqdm import tqdm
from datasets import load_dataset

def save_dataset(examples, output_dir, split_name, hub_name=None):
    subset_dataset = datasets.Dataset.from_list(examples, split=f'{split_name}')
    if output_dir is not None:
        subset_dataset.save_to_disk(f'{output_dir}/{split_name}')
        print(f'Saved {len(subset_dataset)} {split_name} examples to disk')
    if hub_name is not None:
        subset_dataset.push_to_hub(hub_name)
        print(f'Pushed {len(subset_dataset)} {split_name} examples to {hub_name}')
    return subset_dataset

if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--input_dataset', required=True, help='dir')
    parser.add_argument('--output_dir', required=True, help='dir')
    parser.add_argument('--hub_name', required=False)
    args = parser.parse_args()

    dataset = load_dataset('tau/sled', args.input_dataset)

    for split in dataset:
        subset = dataset[split]
        new_subset = []
        for example in tqdm(subset, total=len(subset)):
            new_example = {
                'id': example['id'],
                'pid': example['pid'],
                'input': f"Q: {example['input_prefix']}\nText: {example['input']}",
                'output': example['output'] if example['output'] is not None else '',
            }

            new_subset.append(new_example)
        save_dataset(new_subset, output_dir=args.output_dir, split_name=split, hub_name=args.hub_name)

because I'm not really understand these codes:( there are some paremeters I don't have idea To examplify, I just have 'text', so I have to write 'text' : example['text']? then drop other things,, such as 'id', 'pid', etc. Should I rewrite save_dataset?

urialon commented 11 months ago

You will have to create a dataset that has input and output fields, in order to use our code as is.

On Tue, Sep 19, 2023 at 9:26 AM Lavi11C @.***> wrote:

from argparse import ArgumentParser import datasets from tqdm import tqdm from datasets import load_dataset

def save_dataset(examples, output_dir, split_name, hub_name=None): subset_dataset = datasets.Dataset.from_list(examples, split=f'{split_name}') if output_dir is not None: subset_dataset.save_to_disk(f'{output_dir}/{split_name}') print(f'Saved {len(subset_dataset)} {split_name} examples to disk') if hub_name is not None: subset_dataset.push_to_hub(hub_name) print(f'Pushed {len(subset_dataset)} {split_name} examples to {hub_name}') return subset_dataset

if name == 'main': parser = ArgumentParser() parser.add_argument('--input_dataset', required=True, help='dir') parser.add_argument('--output_dir', required=True, help='dir') parser.add_argument('--hub_name', required=False) args = parser.parse_args()

dataset = load_dataset('tau/sled', args.input_dataset)

for split in dataset:
    subset = dataset[split]
    new_subset = []
    for example in tqdm(subset, total=len(subset)):
        new_example = {
            'id': example['id'],
            'pid': example['pid'],
            'input': f"Q: {example['input_prefix']}\nText: {example['input']}",
            'output': example['output'] if example['output'] is not None else '',
        }

        new_subset.append(new_example)
    save_dataset(new_subset, output_dir=args.output_dir, split_name=split, hub_name=args.hub_name)

because I'm not really understand these codes:( there are some paremeters I don't have idea To examplify, I just have 'text', so I have to write 'text' : example['text']? then drop other things,, such as 'id', 'pid', etc. have should I rewrite save_dataset?

— Reply to this email directly, view it on GitHub https://github.com/abertsch72/unlimiformer/issues/40#issuecomment-1725479139, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMCRQ3HNSG3EL6VMSHLX3GMO3ANCNFSM6AAAAAA4S7WVP4 . You are receiving this because you commented.Message ID: @.***>

Lavi11C commented 11 months ago

So I can't just upload my dataset into hugging face directly? I don't understand you, sorry. I can't just put all .txt file's content into 'input'? because I don't know why I need put something into 'output' when I use inference-example.py.

abertsch72 commented 11 months ago

Hi @Lavi11C ! You have a field called 'text' in your dataset, right? You can use the code Uri posted to save a version of the dataset that renames that field to 'input'. Then you can run the Unlimiformer evaluation directly using run.py. If you don't rename the field, then run.py has no way of determining which field to use as the inputs.

Alternatively, you can use code like inference-example.py and implement your own loop through the dataset and evaluation-- then you can specify the input and output fields that you expect directly.

Lavi11C commented 11 months ago

sorry, my fault. I don't have field called 'text' in my dataset. I meant my dataset's type is text. And I will add 'input' field when I put my dataset into hugging face. I find this code from ChatGPT. Do you think that work or not to suit your rule? I wonder if that can work, then I can run inference-example.py, right?

dataset = []

folder_path = 'path_to_folder_containing_txt_files'

import os

for filename in os.listdir(folder_path):
    if filename.endswith('.txt'):
        with open(os.path.join(folder_path, filename), 'r') as file:
            content = file.read()
            dataset.append({'input': content})

from datasets import Dataset

my_dataset = Dataset.from_dict(dataset)

my_dataset.push_to_hub('my_dataset_name', use_auth_token='YOUR_AUTH_TOKEN')
Lavi11C commented 11 months ago

image This is my datasets on hugging face. And I end to get accuracy. This is a binary(classfication) question. Do you think I can do that by unlimiformer? Because I add 'label', I'm not sure that I follow your rule.

Lavi11C commented 11 months ago

image I still face this problem...... I can't understand why.

encoded_train_dataset = tokenizer.batch_encode_plus(
    dataset_train["train"]["input"],
    truncation=True,
    padding='max_length',
    max_length=99999,
    return_tensors="pt"
)

I set this part for tokenizer. Dose the problem happen here? or because I use the 'label' field, that is not allow to use in your code, isn't it?

or your code may not run for dataset, just can run for one file?

sorry for many questions.