biomedical-cybernetics / Relative-importance-and-activation-pruning

22 stars 7 forks source link

how to get calib_dataset #1

Closed azuryl closed 2 weeks ago

azuryl commented 3 months ago

Dear @biomedical-cybernetics @silence1024 Thanks for your great job!

I want to run your work, but meet some issue raceback (most recent call last): File "/home/delight-gpu/Workspace2/azuryl/Relative-importance-and-activation-pruning/main.py", line 151, in main() File "/home/delight-gpu/Workspace2/azuryl/Relative-importance-and-activation-pruning/main.py", line 92, in main prune_ria(args, model, tokenizer, device, prune_n=prune_n, prune_m=prune_m) File "/home/delight-gpu/Workspace2/azuryl/Relative-importance-and-activation-pruning/lib/prune.py", line 179, in pruneria dataloader, = get_loaders(args.calib_dataset,nsamples=args.nsamples,seed=args.seed,seqlen=args.seqlen,tokenizer=tokenizer) File "/home/delight-gpu/Workspace2/azuryl/Relative-importance-and-activation-pruning/lib/data.py", line 120, in get_loaders return get_c4(nsamples, seed, seqlen, tokenizer) File "/home/delight-gpu/Workspace2/azuryl/Relative-importance-and-activation-pruning/lib/data.py", line 87, in get_c4 traindata = load_dataset('../../c4', data_files={'train': 'c4-train.00000-of-01024.json'}, split='train') File "/home/azuryl/anaconda3/envs/prune_ria/lib/python3.10/site-packages/datasets/load.py", line 2523, in load_dataset builder_instance = load_dataset_builder( File "/home/azuryl/anaconda3/envs/prune_ria/lib/python3.10/site-packages/datasets/load.py", line 2195, in load_dataset_builder dataset_module = dataset_module_factory( File "/home/azuryl/anaconda3/envs/prune_ria/lib/python3.10/site-packages/datasets/load.py", line 1848, in dataset_module_factory raise FileNotFoundError( FileNotFoundError: Couldn't find a dataset script at /home/delight-gpu/Workspace2/c4/c4.py or any data file in the same directory.

silence1024 commented 3 months ago

Hi @azuryl, thanks for trying our code.

The C4 dataset is not always stablely accessible in our region (China). So, I downloaded it directly onto my local disk and loaded it with the load_dataset function. You can either directly access huggingface by:

traindata = load_dataset('allenai/c4', 'allenai--c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

or download the dataset to your local disk and revise the code here: traindata = load_dataset('../../c4', data_files={'train': 'c4-train.00000-of-01024.json'}, split='train') to the location of your c4 dataset.