QData / TextAttack

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP https://textattack.readthedocs.io/en/master/
https://textattack.readthedocs.io/en/master/
MIT License
2.87k stars 383 forks source link

ValueError: Unsupported dataset schema #449 #529

Closed marwanomar1 closed 2 years ago

marwanomar1 commented 2 years ago

I am running adversarial training on NLP models and I am getting an error " ValueError: Unsupported dataset schema ". When I run the following code: import textattack import transformers from textattack.datasets import HuggingFaceDataset

model = transformers.AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased") model_wrapper = textattack.models.wrappers.HuggingFaceModelWrapper(model, tokenizer)

We only use DeepWordBugGao2018 to demonstration purposes. attack = textattack.attack_recipes.DeepWordBugGao2018.build(model_wrapper) train_dataset = HuggingFaceDataset('squad', split='train') eval_dataset = HuggingFaceDataset('squad', split='validation')

Train for 3 epochs with 1 initial clean epochs, 1000 adversarial examples per epoch, learning rate of 5e-5, and effective batch size of 32 (8x4). training_args = textattack.TrainingArgs( num_epochs=3, num_clean_epochs=1, num_train_adv_examples=1000, learning_rate=5e-5, per_device_train_batch_size=8, gradient_accumulation_steps=4, log_to_tb=True, )

trainer = textattack.Trainer( model_wrapper, "classification", attack,

eval_dataset, training_args ) trainer.train() @jxmorris12

jxmorris12 commented 2 years ago

I suggested a fix that you haven't tried yet:

A quick diagnosis tells me you should be using our HuggingFaceDataset class to wrap the dataset instead of just importing it directly from huggingface datasets. so in the code you posted, your dataset initializations might look something like:

from textattack.datasets import HuggingFaceDataset

train_dataset = HuggingFaceDataset('squad', split='train')
eval_dataset = HuggingFaceDataset('squad', split='validation')
marwanomar1 commented 2 years ago

Thank you, Jack. Things are working now. In the same code above when I try the yelp dataset, it shows that it will take several days to complete because the size of examples is about 560.000.00.

Is it possible to reduce the number of examples to about 10k so that it would go faster?

jxmorris12 commented 2 years ago

yes! I would try using the rotten_tomatoes dataset instead. It's much smaller.

marwanomar1 commented 2 years ago

Great. Many thanks. I really appreciate it.

marwanomar1 commented 2 years ago

I am running the following code to test IMDB on WordCNN model-

It gives me error: NameError: name 'model_wrapper' is not defined

!pip install textattack !pip install -U tensorflow-text import textattack import json import os import torch from torch import nn as nn from torch.nn import functional as F import textattack from textattack.model_args import TEXTATTACK_MODELS from textattack.models.helpers import GloveEmbeddingLayer from textattack.models.helpers.utils import load_cached_state_dict from textattack.shared import utils import textattack

We only use DeepWordBugGao2018 to demonstration purposes.

attack = textattack.attack_recipes.DeepWordBugGao2018.build(model_wrapper) train_dataset = textattack.datasets.HuggingFaceDataset("imdb", split="train") eval_dataset = textattack.datasets.HuggingFaceDataset("imdb", split="test")

Train for 3 epochs with 1 initial clean epochs, 1000 adversarial examples per epoch, learning rate of 5e-5, and effective batch size of 32 (8x4).

training_args = textattack.TrainingArgs( num_epochs=3, num_clean_epochs=1, num_train_adv_examples=1000, learning_rate=5e-5, per_device_train_batch_size=8, gradient_accumulation_steps=4, log_to_tb=True, ) trainer = textattack.Trainer( model_wrapper, "classification", attack, train_dataset, eval_dataset, training_args ) trainer.train()

@jxmorris12

jxmorris12 commented 2 years ago

uhh, yeah, you still need this piece of the code:

model = transformers.AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
model_wrapper = textattack.models.wrappers.HuggingFaceModelWrapper(model, tokenizer)
marwanomar1 commented 2 years ago

That worked. Many thanks!

marwanomar1 commented 2 years ago

I ran the training on LSTM using command: textattack train --model-name-or-path lstm --dataset yelp_polarity --epochs 50 --learning-rate 1e-5 so now I want to know which command to use to attack this same model which I just trained. I want to attack it with textfooler

jxmorris12 commented 2 years ago

Pretty sure you have to create a model wrapper file and use the --model-from-file argument to textattack attack. Or you could just write a script that loads the model and runs attacks in the script.

marwanomar1 commented 2 years ago

When I try to run an attack using my saved model. I use this command: !textattack attack --recipe textfooler --num-examples 100 --model ./outputs/2021-09-15-06-37-33-327512/best_model --dataset-from-huggingface imdb --dataset-split test white_check_mark eyes raised_hands

but it gives me this error: ValueError: Error: unsupported TextAttack model ./outputs/2021-09-15-06-37-33-327512/best_model

Do you know what could be going wrong?

@jxmorris12

jxmorris12 commented 2 years ago

You're using --model, not --model-from-file, I think that's the problem!

marwanomar1 commented 2 years ago

I am trying to run an attack on a pretrained, fine-tuned model as follows: !textattack attack --model cardiffnlp/twitter-roberta-base-offensive --recipe deepwordbug --num-examples 10

but its giving me the following error: ValueError: Must supply pretrained model or dataset

I am not sure why it would not take the pretrained model above- is there anything I am doing wrong here?

@jxmorris12