YebowenHu / MeetingBank-utils

8 stars 1 forks source link

error in code snippet && where to download actual data? #3

Open 370025263 opened 1 month ago

370025263 commented 1 month ago

i followed the code in readme and i get error.

ERROR: Traceback (most recent call last): File "/mnt/disk0/user/chenwei/stone/workspace/meet_bank_dataset/get.py", line 2, in meetingbank = load_dataset("huuuyeah/meetingbank") File "/home/chenwei/anaconda3/lib/python3.9/site-packages/datasets/load.py", line 1773, in load_dataset builder_instance = load_dataset_builder( File "/home/chenwei/anaconda3/lib/python3.9/site-packages/datasets/load.py", line 1502, in load_dataset_builder dataset_module = dataset_module_factory( File "/home/chenwei/anaconda3/lib/python3.9/site-packages/datasets/load.py", line 1219, in dataset_module_factory raise e1 from None File "/home/chenwei/anaconda3/lib/python3.9/site-packages/datasets/load.py", line 1196, in dataset_module_factory return HubDatasetModuleFactoryWithoutScript( File "/home/chenwei/anaconda3/lib/python3.9/site-packages/datasets/load.py", line 769, in get_module else get_data_patterns_in_dataset_repository(hfh_dataset_info, self.data_dir) File "/home/chenwei/anaconda3/lib/python3.9/site-packages/datasets/data_files.py", line 658, in get_data_patterns_in_dataset_repository return _get_data_files_patterns(resolver) File "/home/chenwei/anaconda3/lib/python3.9/site-packages/datasets/data_files.py", line 223, in _get_data_files_patterns data_files = pattern_resolver(pattern) File "/home/chenwei/anaconda3/lib/python3.9/site-packages/datasets/data_files.py", line 471, in _resolve_single_pattern_in_dataset_repository glob_iter = [PurePath(filepath) for filepath in fs.glob(PurePath(pattern).as_posix()) if fs.isfile(filepath)] File "/home/chenwei/anaconda3/lib/python3.9/site-packages/fsspec/spec.py", line 608, in glob pattern = glob_translate(path + ("/" if ends_with_sep else "")) File "/home/chenwei/anaconda3/lib/python3.9/site-packages/fsspec/utils.py", line 732, in glob_translate raise ValueError( ValueError: Invalid pattern: '**' can only be an entire path component

CODE: from datasets import load_dataset meetingbank = load_dataset("huuuyeah/meetingbank")

train_data = meetingbank['train'] test_data = meetingbank['test'] val_data = meetingbank['validation']

def generator(data_split): for instance in data_split: yield instance['id'], instance['summary'], instance['transcript']

YebowenHu commented 1 month ago

The loading function from huggingface works well. You may try to check the python environment and datasets version. I would like to provide a link here where you could directly download json files from source.

https://huggingface.co/datasets/huuuyeah/meetingbank/tree/main

load the json file in lines and yield each instance with same key from {"id", "summary", "transcript"}