Example config file for nlp datasets

mridulgarg11 commented 1 year ago

Hi authors, is there any example of a config file for nlp datasets, especially around applying transformations? I’m trying to create a minimal example. Here's the config file I'm using-

# renate_config.py
from pathlib import Path
from typing import Callable, Optional, Union

import torch
import torchtext.transforms as T
from torch.hub import load_state_dict_from_url
from torchvision.transforms import Lambda

from renate.benchmark.models.mlp import MultiLayerPerceptron

from renate.models import RenateModule

from renate import defaults
from renate.benchmark.datasets.nlp_datasets import TorchTextDataModule
from renate.benchmark.scenarios import ClassIncrementalScenario, Scenario

def data_module_fn(data_path: Union[Path, str], chunk_id: int, seed: int = defaults.SEED) -> Scenario:
    """Returns a class-incremental scenario instance.
    The transformations passed to prepare the input data are required to convert the data to
    PyTorch tensors.
    """
    data_module = TorchTextDataModule(
        str(data_path),
        dataset_name="AG_news",
        val_size=0.1,
        seed=seed,
    )

    class_incremental_scenario = ClassIncrementalScenario(
        data_module=data_module,
        class_groupings=[[0, 1], [2, 3]],
        chunk_id=chunk_id,
    )
    return class_incremental_scenario

def model_fn(model_state_url: Optional[Union[Path, str]] = None) -> RenateModule:
    """Returns a model instance."""
    if model_state_url is None:
        model = MultiLayerPerceptron(
            num_inputs=256, num_outputs=4, num_hidden_layers=2, hidden_size=64
        )
    else:
        state_dict = torch.load(str(model_state_url))
        model = MultiLayerPerceptron.from_state_dict(state_dict)
    return model

def train_transform() -> Callable:
    """Returns a transform function to be used in the training."""
    padding_idx = 1
    bos_idx = 0
    eos_idx = 2
    max_seq_len = 256
    xlmr_vocab_path = r"https://download.pytorch.org/models/text/xlmr.vocab.pt"
    xlmr_spm_model_path = r"https://download.pytorch.org/models/text/xlmr.sentencepiece.bpe.model"
    text_transform = T.Sequential(
        T.SentencePieceTokenizer(xlmr_spm_model_path),
        T.VocabTransform(load_state_dict_from_url(xlmr_vocab_path)),
        T.Truncate(max_seq_len - 2),
        T.AddToken(token=bos_idx, begin=True),
        T.AddToken(token=eos_idx, begin=False),
    )

    return Lambda(lambda x: text_transform(x))

And the training job is below-

from renate.tuning import execute_tuning_job

config_space = {
    "optimizer": "SGD",
    "momentum": 0.0,
    "weight_decay": 1e-2,
    "learning_rate": 0.05,
    "batch_size": 32,
    "max_epochs": 5,
    "memory_batch_size": 32,
    "memory_size": 500,
}

if __name__ == "__main__":

    execute_tuning_job(
        config_space=config_space,
        mode="max",
        metric="val_accuracy",
        updater="ER",
        max_epochs=5,
        chunk_id=0,  # this selects the first chunk of the dataset
        config_file="renate_config.py",
        next_state_url="./output_folder/",  # this is where the model will be stored
        backend="local",  # the training job will run on the local machine
    )

This is the stack trace of the error I'm getting-

Logs (stderr):

Global seed set to 0
[2023-01-26 19:28:50,625] INFO [renate.updaters.model_updater:183] No location for current updater state provided. Updating will start from scratch.
[2023-01-26 19:28:50,625] WARNING [renate.updaters.model_updater:254] No updater state available. Updating from scratch.
Traceback (most recent call last):
  File "/Users/mridulgr/opt/anaconda3/envs/cl/lib/python3.9/site-packages/renate/cli/run_training.py", line 249, in <module>
    ModelUpdaterCLI().run()
  File "/Users/mridulgr/opt/anaconda3/envs/cl/lib/python3.9/site-packages/renate/cli/run_training.py", line 238, in run
    model_updater.update(
  File "/Users/mridulgr/opt/anaconda3/envs/cl/lib/python3.9/site-packages/renate/updaters/model_updater.py", line 331, in update
    train_loader, val_loader = self._learner.on_model_update_start(
  File "/Users/mridulgr/opt/anaconda3/envs/cl/lib/python3.9/site-packages/renate/updaters/experimental/er.py", line 83, in on_model_update_start
    train_loader, val_loader = super().on_model_update_start(
  File "/Users/mridulgr/opt/anaconda3/envs/cl/lib/python3.9/site-packages/renate/updaters/learner.py", line 240, in on_model_update_start
    train_loader = DataLoader(
  File "/Users/mridulgr/opt/anaconda3/envs/cl/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 344, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
  File "/Users/mridulgr/opt/anaconda3/envs/cl/lib/python3.9/site-packages/torch/utils/data/sampler.py", line 107, in __init__
    raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

610v4nn1 commented 1 year ago

I will look into this today

610v4nn1 commented 1 year ago

This is happening because the train_dataset passed to the DataLoader (learner.py L241) has size zero. This happens because the training set indata_module in run_training.py is seen as a Subset of size zero. I think there is something to be changed in how the dataset is generated.

I am attaching a screenshot from my debugger hoping this can help. If it doesn't we can catch-up offline and we will try to provide an example for that function.

Screen Shot 2023-01-27 at 11 22 47

nit: Not sure what is your intent with this script but we use the ClassIncrementalScenario in our examples because it is useful to simulate updates with different data. This is not what you want to do when training a real model. For that you can just drop the scenario.

mridulgarg11 commented 1 year ago

Thanks for your reply, I was able to resolve the error and you're right it was about how the dataset was being generated. I ended up performing the transformations outside of Renate, it would still be helpful if you could share any example of how the text transforms can be applied within Renate.

In regards to your second note, I was using this script to simulate the model updates with different data. I have two goals here- 1)benchmark different CL algorithms against the fine-tuning method on my own dataset, 2) once I find the best results I would setup a model retraining pipeline. My understanding is that for the 1) point, I would use the ClassIncrementalScenario for simulating CL in offline setting. For the 2) point, I wouldn't need to use any Scenario and just pass in the new dataset. Let me know if that sounds correct.

610v4nn1 commented 1 year ago

It's perfectly fine to use the scenario to simulate different situations, I brought it up just FYI. Given your explanation, it seems reasonable to use it as a starting point but you will probably need to tune a few more things (e.g., hyperparameters) depending on your data/problem.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 14 days with no activity. Resume the discussion in the next week or it will be closed automatically.

awslabs / Renate

Example config file for nlp datasets #107