Closed LalitWagh closed 2 years ago
So, I read the notebook. It helped me understand that we need the dataset on ordered dictionary in order to use textattack. Adding to my question, in aws server i have data but it has nearly 37 million rows and i dont want to convert it to ordered dictionary. So what solution can you propose for this problem? Your any type of help is highly appreciated
Hi @LalitWagh, I think with our current implementation the only two ways to prepare a dataset is to download one from huggingface or create one from scratch (using an ordered dict as you suggest). I couldn't think of a way to directly use your own data without changing the dataset class. I'll leave this issue open to see if others have any suggestions.
@LalitWagh the datasets HuggingFace supports are backed by Apache Arrow and can definitely generalize to 37 million datapoints. Otherwise, you can create your own dataset class and implement it as you like. There is no requirement to store 37 million OrderedDict
objects in memory, only that you convert each row to an OrderedDict
when it is used for the attack (returned by your dataset __getitem__
).
By the way, what is the reason you would ever want to generate 37 million adversarial examples from the same model?
After evaluating and reading docs, i cant find any way to do the attacks on larger datasets that we will be fetching from database server. and the dataset needs to be converted before attacking which is very hard to implement. Any solution of example notebook can you create to solve this?