Open Lavi11C opened 12 months ago
Hi @Lavi11C , Thank you for your interest in our work?
@abertsch72 - do you have an idea regarding inference_example.py
?
In the meantime, I suggest following these instructions: https://github.com/abertsch72/unlimiformer#reproducing-the-experiments-from-the-paper---command-lines which are fully reproducible.
Let us know if you have any questions!
Best, Uri
thanks for your replay. Do you mean I can run run.py with my datasets by these arguments? "src/configs/data/gov_report.json \" I think this part is about datasets. And my datasets are all txt files. I can replace it directly? or I need to preprocess something before put them in? Final question, because my project is about detect this file is benign or malicious, I end up to get an accuracy. After I run run.py. I think I will get bertscore. Is that accuracy? Or I have to write code about accuracy by myself? Lot of question, thank you to fix the problems for me.
Yes, you can definitely run with your own datasets, by duplicating the json file and editing it to point to another dataset.
Note that your data needs to be in the same Huggingface format, for example: https://huggingface.co/datasets/tau/sled/viewer/gov_report/train
You will need to create a Huggingface dataset with fields of input
and output
.
I'm attaching an example that shows how to create such datasets.
You can see how some of the other datasets in our repo (other than GovReport.json
) have accuracy
as their metric. If you define accuracy in your custom json, it will be measured instead of ROUGE and bertscore.
Best, Uri
from argparse import ArgumentParser
import datasets
from tqdm import tqdm
from datasets import load_dataset
def save_dataset(examples, output_dir, split_name, hub_name=None):
subset_dataset = datasets.Dataset.from_list(examples, split=f'{split_name}')
if output_dir is not None:
subset_dataset.save_to_disk(f'{output_dir}/{split_name}')
print(f'Saved {len(subset_dataset)} {split_name} examples to disk')
if hub_name is not None:
subset_dataset.push_to_hub(hub_name)
print(f'Pushed {len(subset_dataset)} {split_name} examples to {hub_name}')
return subset_dataset
if __name__ == '__main__':
parser = ArgumentParser()
parser.add_argument('--input_dataset', required=True, help='dir')
parser.add_argument('--output_dir', required=True, help='dir')
parser.add_argument('--hub_name', required=False)
args = parser.parse_args()
dataset = load_dataset('tau/sled', args.input_dataset)
for split in dataset:
subset = dataset[split]
new_subset = []
for example in tqdm(subset, total=len(subset)):
new_example = {
'id': example['id'],
'pid': example['pid'],
'input': f"Q: {example['input_prefix']}\nText: {example['input']}",
'output': example['output'] if example['output'] is not None else '',
}
new_subset.append(new_example)
save_dataset(new_subset, output_dir=args.output_dir, split_name=split, hub_name=args.hub_name)
You mean if I wanna use unlimiformer, I have to put my datasets on hugging face first?? I can't input datasets into code from local site, right?
You don't necessarily need to upload it to the huggingface hub, but you do need to convert it to the Huggingface format, and then just save it to disk and load it from there.
BTW, my own datasets are all txt file. could I just load data like this way"dataset = load_dataset(‘text’, data_files={‘train’: [‘my_text_1.txt’, ‘my_text_2.txt’], ‘test’: ‘my_test_file.txt’})"?
I don't know, you can try. But how do you specify the output this way?
I found this way in Internet. And the type is "Dataset". So I think that is ok, doesn't it?
from argparse import ArgumentParser import datasets from tqdm import tqdm from datasets import load_dataset def save_dataset(examples, output_dir, split_name, hub_name=None): subset_dataset = datasets.Dataset.from_list(examples, split=f'{split_name}') if output_dir is not None: subset_dataset.save_to_disk(f'{output_dir}/{split_name}') print(f'Saved {len(subset_dataset)} {split_name} examples to disk') if hub_name is not None: subset_dataset.push_to_hub(hub_name) print(f'Pushed {len(subset_dataset)} {split_name} examples to {hub_name}') return subset_dataset if __name__ == '__main__': parser = ArgumentParser() parser.add_argument('--input_dataset', required=True, help='dir') parser.add_argument('--output_dir', required=True, help='dir') parser.add_argument('--hub_name', required=False) args = parser.parse_args() dataset = load_dataset('tau/sled', args.input_dataset) for split in dataset: subset = dataset[split] new_subset = [] for example in tqdm(subset, total=len(subset)): new_example = { 'id': example['id'], 'pid': example['pid'], 'input': f"Q: {example['input_prefix']}\nText: {example['input']}", 'output': example['output'] if example['output'] is not None else '', } new_subset.append(new_example) save_dataset(new_subset, output_dir=args.output_dir, split_name=split, hub_name=args.hub_name)
because I'm not really understand these codes:( there are some paremeters I don't have idea To examplify, I just have 'text', so I have to write 'text' : example['text']? then drop other things,, such as 'id', 'pid', etc. Should I rewrite save_dataset?
You will have to create a dataset that has input
and output
fields, in
order to use our code as is.
On Tue, Sep 19, 2023 at 9:26 AM Lavi11C @.***> wrote:
from argparse import ArgumentParser import datasets from tqdm import tqdm from datasets import load_dataset
def save_dataset(examples, output_dir, split_name, hub_name=None): subset_dataset = datasets.Dataset.from_list(examples, split=f'{split_name}') if output_dir is not None: subset_dataset.save_to_disk(f'{output_dir}/{split_name}') print(f'Saved {len(subset_dataset)} {split_name} examples to disk') if hub_name is not None: subset_dataset.push_to_hub(hub_name) print(f'Pushed {len(subset_dataset)} {split_name} examples to {hub_name}') return subset_dataset
if name == 'main': parser = ArgumentParser() parser.add_argument('--input_dataset', required=True, help='dir') parser.add_argument('--output_dir', required=True, help='dir') parser.add_argument('--hub_name', required=False) args = parser.parse_args()
dataset = load_dataset('tau/sled', args.input_dataset) for split in dataset: subset = dataset[split] new_subset = [] for example in tqdm(subset, total=len(subset)): new_example = { 'id': example['id'], 'pid': example['pid'], 'input': f"Q: {example['input_prefix']}\nText: {example['input']}", 'output': example['output'] if example['output'] is not None else '', } new_subset.append(new_example) save_dataset(new_subset, output_dir=args.output_dir, split_name=split, hub_name=args.hub_name)
because I'm not really understand these codes:( there are some paremeters I don't have idea To examplify, I just have 'text', so I have to write 'text' : example['text']? then drop other things,, such as 'id', 'pid', etc. have should I rewrite save_dataset?
— Reply to this email directly, view it on GitHub https://github.com/abertsch72/unlimiformer/issues/40#issuecomment-1725479139, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMCRQ3HNSG3EL6VMSHLX3GMO3ANCNFSM6AAAAAA4S7WVP4 . You are receiving this because you commented.Message ID: @.***>
So I can't just upload my dataset into hugging face directly? I don't understand you, sorry. I can't just put all .txt file's content into 'input'? because I don't know why I need put something into 'output' when I use inference-example.py.
Hi @Lavi11C ! You have a field called 'text' in your dataset, right? You can use the code Uri posted to save a version of the dataset that renames that field to 'input'. Then you can run the Unlimiformer evaluation directly using run.py
. If you don't rename the field, then run.py
has no way of determining which field to use as the inputs.
Alternatively, you can use code like inference-example.py
and implement your own loop through the dataset and evaluation-- then you can specify the input and output fields that you expect directly.
sorry, my fault. I don't have field called 'text' in my dataset. I meant my dataset's type is text. And I will add 'input' field when I put my dataset into hugging face. I find this code from ChatGPT. Do you think that work or not to suit your rule? I wonder if that can work, then I can run inference-example.py, right?
dataset = []
folder_path = 'path_to_folder_containing_txt_files'
import os
for filename in os.listdir(folder_path):
if filename.endswith('.txt'):
with open(os.path.join(folder_path, filename), 'r') as file:
content = file.read()
dataset.append({'input': content})
from datasets import Dataset
my_dataset = Dataset.from_dict(dataset)
my_dataset.push_to_hub('my_dataset_name', use_auth_token='YOUR_AUTH_TOKEN')
This is my datasets on hugging face. And I end to get accuracy. This is a binary(classfication) question. Do you think I can do that by unlimiformer? Because I add 'label', I'm not sure that I follow your rule.
I still face this problem...... I can't understand why.
encoded_train_dataset = tokenizer.batch_encode_plus(
dataset_train["train"]["input"],
truncation=True,
padding='max_length',
max_length=99999,
return_tensors="pt"
)
I set this part for tokenizer. Dose the problem happen here? or because I use the 'label' field, that is not allow to use in your code, isn't it?
or your code may not run for dataset, just can run for one file?
sorry for many questions.
I can run inference-example.py. But when I try to combined it to my own code. I face this problem "too many indices for tensor of dimension 1". I guess in inference-example, the datasets are just one tensor. And my own datasets are about pdf. I think that is one tensor as well. However, I think I ignore somewhere. Would you help me to fix this problem? Thank you very much.