Can we perform textattack on bigdata?

QData / TextAttack

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP https://textattack.readthedocs.io/en/master/

https://textattack.readthedocs.io/en/master/

MIT License

2.95k stars 395 forks source link

Can we perform textattack on bigdata? #616

Closed LalitWagh closed 2 years ago

LalitWagh commented 2 years ago

After evaluating and reading docs, i cant find any way to do the attacks on larger datasets that we will be fetching from database server. and the dataset needs to be converted before attacking which is very hard to implement. Any solution of example notebook can you create to solve this?

Hanyu-Liu-123 commented 2 years ago

Hi @LalitWagh, the dataset class that TextAttack attacker uses is implemented here, and it provides a short example of creating your own dataset. This notebook also introduces the process of using your dataset for attacks.

Thanks!

LalitWagh commented 2 years ago

So, I read the notebook. It helped me understand that we need the dataset on ordered dictionary in order to use textattack. Adding to my question, in aws server i have data but it has nearly 37 million rows and i dont want to convert it to ordered dictionary. So what solution can you propose for this problem? Your any type of help is highly appreciated

Hanyu-Liu-123 commented 2 years ago

Hi @LalitWagh, I think with our current implementation the only two ways to prepare a dataset is to download one from huggingface or create one from scratch (using an ordered dict as you suggest). I couldn't think of a way to directly use your own data without changing the dataset class. I'll leave this issue open to see if others have any suggestions.

jxmorris12 commented 2 years ago

@LalitWagh the datasets HuggingFace supports are backed by Apache Arrow and can definitely generalize to 37 million datapoints. Otherwise, you can create your own dataset class and implement it as you like. There is no requirement to store 37 million OrderedDict objects in memory, only that you convert each row to an OrderedDict when it is used for the attack (returned by your dataset __getitem__).

jxmorris12 commented 2 years ago

By the way, what is the reason you would ever want to generate 37 million adversarial examples from the same model?