Scripts to prepare catalogue data.
Clone this repo.
Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation
sudo apt-get install git-lfs
git lfs install
Install dependencies:
sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar
Create virtual environment, activate it and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token
and set environment variables in the .env
file at the root directory:
HF_USERNAME=<Replace with your Hugging Face username>
HF_USER_ACCESS_TOKEN=<Replace with your Hugging Face API token>
GIT_USER=<Replace with your Git user>
GIT_EMAIL=<Replace with your Git email>
To create dataset metadata (in file dataset_infos.json
) run:
python create_metadata.py --repo <repo_id>
where you should replace <repo_id>
, e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad
To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:
python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>
where you should replace:
path_to_file_with_dataset_ratios
: path to JSON file containing a dict with dataset names (keys) and their ratio
(values) between 0 and 1.<dir_path_to_save_aggregated_dataset>
: directory path to save the aggregated datasetimport stanza
for lang in {"ar", "ca", "eu", "id", "vi", "zh-hans", "zh-hant"}:
stanza.download(lang, logging_level="WARNING")
git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
export INDIC_RESOURCES_PATH=<PATH_TO_REPO>
import nltk nltk.download("punkt")