bigscience-workshop / multilingual-modeling

BLOOM+1: Adapting BLOOM model to support a new unseen language
https://arxiv.org/abs/2212.09535
Apache License 2.0
69 stars 15 forks source link

data loader for multiatis #4

Closed sbmaruf closed 2 years ago

sbmaruf commented 2 years ago

Since multiatis is under LDC and dataset cannot be uploaded to hf, here is the data loader for it. @yongzx

Example use case:

To Load dataset by,

dataset_path_or_name = address of this (data_loader/multiatis.py) file
dataset_config_name = language_name
data_files = path_to_multiatis_dataset_folder/MultiATIS++/data/train_dev_test
dataset = datasets.load_dataset(
        dataset_path_or_name,
        dataset_config_name,
        data_files=data_files,
        cache_dir=data_args.cache_dir,
    )
yongzx commented 2 years ago

With this code:

import datasets

dataset_path_or_name = "/users/zyong2/data/zyong2/bigscience/data/external/MultilingualNLU/data_loader/multiatis.py"
dataset_config_name = "fr"
data_files = "/users/zyong2/data/zyong2/bigscience/data/external/MultilingualNLU/data/MultiATIS++/data/train_dev_test"
dataset = datasets.load_dataset(
        dataset_path_or_name,
        dataset_config_name,
        data_files=data_files,
        cache_dir="/users/zyong2/data/zyong2/bigscience/data/external/MultilingualNLU",
    )

I run into this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_89745/3134968329.py in <module>
      8         dataset_config_name,
      9         data_files=data_files,
---> 10         cache_dir="/users/zyong2/data/zyong2/bigscience/data/external/MultilingualNLU",
     11     )

/gpfs/data/sbach/zyong2/bigscience/env_lang_mod/lib/python3.7/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, script_version, use_auth_token, task, streaming, **config_kwargs)
    850         ignore_verifications=ignore_verifications,
    851         try_from_hf_gcs=try_from_hf_gcs,
--> 852         use_auth_token=use_auth_token,
    853     )
    854 

/gpfs/data/sbach/zyong2/bigscience/env_lang_mod/lib/python3.7/site-packages/datasets/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
    614                     if not downloaded_from_gcs:
    615                         self._download_and_prepare(
--> 616                             dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    617                         )
    618                     # Sync info

/gpfs/data/sbach/zyong2/bigscience/env_lang_mod/lib/python3.7/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    669         split_dict = SplitDict(dataset_name=self.name)
    670         split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 671         split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    672 
    673         # Checksums verification

~/.cache/huggingface/modules/datasets_modules/datasets/multiatis/2de3e038c6d239bff285552d696e02084460f9567b01d1d854e402abe8cb0ad9/multiatis.py in _split_generators(self, _MultiAtis__dl_manager)
     65             train_data_files = [
     66                 _file
---> 67                 for _file in self.config.data_files["train"]
     68                 if _file.split("_")[-1].split(".tsv")[0] == lang.upper()
     69             ]

TypeError: string indices must be integers

When I print self.config.data_files, it gives "/users/zyong2/data/zyong2/bigscience/data/external/MultilingualNLU/data/MultiATIS++/data/train_dev_test" instead of a dictionary.

sbmaruf commented 2 years ago

Sorry there should be some correction, Send this variable to your data preparation script,

RAW_DATASET_PATH="$RAW_DATA_FOLDER/MultiATIS++/data/train_dev_test"
TRAIN_FILE_NAMES=$RAW_DATASET_PATH/train_*
VALIDATION_FILE_NAMES=$RAW_DATASET_PATH/dev_*
TEST_FILE_NAMES=$RAW_DATASET_PATH/test_*

In the python code add actual file names to the data_files dictionary.

data_files = {}
if data_args.train_file_names is not None:
    data_files[datasets.Split.TRAIN] = data_args.train_file_names
if data_args.validation_file_names is not None:
    data_files[datasets.Split.VALIDATION] = data_args.validation_file_names
if data_args.test_file_names is not None:
    data_files[datasets.Split.TEST] = data_args.test_file_names

Then send the data_files to the load_dataset module.

yongzx commented 2 years ago

Thanks! I will make the following changes. Right now my SLURM running queue is filled out so multiatis training will be on pending. I will let you know if I run into any problem.

sbmaruf commented 2 years ago

I would recommend, try resolving ‘load_dataset’ issue before slurm job submission.

yongzx commented 2 years ago

Yeap! I will run am running a notebook to see if the code is running.

yongzx commented 2 years ago

Two questions: 1. What would data_args be? 2. Where does the environmental variables go? I don't have a data preparation script (I suppose it is not the data_loader/multiatis.py, right?)

Edit: nevermind, I think I know how to fix the issue.

yongzx commented 2 years ago

Solved it with

import datasets
import pathlib

dataset_path_or_name = "/users/zyong2/data/zyong2/bigscience/data/external/MultilingualNLU/data_loader/multiatis.py"
dataset_config_name = "fr"
data_files = pathlib.Path("/users/zyong2/data/zyong2/bigscience/data/external/MultilingualNLU/data/MultiATIS++/data/train_dev_test")

data_files_dict = {}
data_files_dict[datasets.Split.TRAIN] = [str(data_files / f"train_{dataset_config_name.upper()}.tsv")]
data_files_dict[datasets.Split.VALIDATION] = [str(data_files / f"dev_{dataset_config_name.upper()}.tsv")]
data_files_dict[datasets.Split.TEST] = [str(data_files / f"test_{dataset_config_name.upper()}.tsv")]

dataset = datasets.load_dataset(
        dataset_path_or_name,
        dataset_config_name,
        data_files=data_files_dict,
        cache_dir="/users/zyong2/data/zyong2/bigscience/data/external/MultilingualNLU/cached",
    )

which returns

DatasetDict({
    train: Dataset({
        features: ['id', 'chunks', 'chunk_labels', 'lang', 'class'],
        num_rows: 4475
    })
    validation: Dataset({
        features: ['id', 'chunks', 'chunk_labels', 'lang', 'class'],
        num_rows: 490
    })
    test: Dataset({
        features: ['id', 'chunks', 'chunk_labels', 'lang', 'class'],
        num_rows: 893
    })
})
yongzx commented 2 years ago

Closing this because we are not using this dataset for language adaptation.