Glue MNLI task fails due to missing 'validation' key in dataset

mariomeissner commented 2 years ago

🐛 Bug

MNLI has two validation and two test sets, called validation_matched, validation_mismatched, test_matched and test_matched. I assume that this was not taken into account in the datamodule.

To Reproduce

Steps to reproduce the behavior:

Run the following command:

python train.py task=nlp/text_classification dataset=nlp/text_classification/glue dataset.cfg.dataset_config_name=mnli

Expected behavior

I would expect the dataloader to handle the special case of MNLI and load validation_matched and test_matched by default. Maybe add an option to additionally test on test_mismatched as well, when desired.

Environment

A standard pip install from source, as of 2021.12.01. Fails with or without GPU.

mariomeissner commented 2 years ago

Huggingface datasets does have a dataset called mnli_matched where the two dictionary keys are simply called validation and test, which would solve the issue. Unfortunately this dataset does not have a train set, so we are stuck with using mnli.

mariomeissner commented 2 years ago

I'd submit a PR if I knew what approach is recommended to accommodate this edge case.

mathemusician commented 2 years ago

To anyone else trying to reproduce this, this only works if you run pip install pytorch-lightning==1.4 My environment: Google Colab

I'd change lightning_transformers/core/data.py to something like this:

    def train_dataloader(self) -> DataLoader:
        if hasattr(self.cfg, "train_dataset"):
            dataset_name = self.cfg.train_dataset 
        elif "train" in self.ds:
            dataset_name = "train"
        else:
            raise KeyError("'train' subset not found in dataset")

        return DataLoader(
            self.ds[dataset_name],
            batch_size=self.batch_size,
            num_workers=self.cfg.num_workers,
            collate_fn=self.collate_fn,
        )

    def val_dataloader(self) -> DataLoader:
        if hasattr(self.cfg, "valid_dataset"):
            dataset_name = self.cfg.valid_dataset 
        elif "validation" in self.ds:
            dataset_name = "validation"
        else:
            raise KeyError("'validation' subset not found in dataset")

        return DataLoader(
            self.ds[dataset_name],
            batch_size=self.batch_size,
            num_workers=self.cfg.num_workers,
            collate_fn=self.collate_fn,
        )

    def test_dataloader(self) -> Optional[DataLoader]:
        if hasattr(self.cfg, "test_dataset"):
            dataset_name = self.cfg.test_dataset 
        elif "test" in self.ds:
            dataset_name = "test"
        else:
            raise KeyError("'test' subset not found in dataset")

        return DataLoader(
            self.ds[dataset_name],
            batch_size=self.batch_size,
            num_workers=self.cfg.num_workers,
            collate_fn=self.collate_fn,
        )

This way, I can specify subsets if possible. This should work on any dataset, not just mnli

You'd have to write that this is possible somewhere in the docs, but yeah, you get the jist.

Specifying this in hydra should be as easy as:

!pl-transformers-train                       \
      task=nlp/text_classification           \
      dataset=nlp/text_classification/glue   \
      dataset.cfg.dataset_config_name=mnli   \
      ++dataset.cfg.valid_dataset=validation_matched

mariomeissner commented 2 years ago

Interesting, when I run your notebook with lightning >= 1.5, I get:

TypeError: Error instantiating 'pytorch_lightning.trainer.trainer.Trainer' : __init__() got an unexpected keyword argument 'truncated_bptt_steps'

I assume that when that error is fixed, the next error will be the validation key missing though. When I run it locally with git clone and pip install . and lightning >= 1.5, I didn't get this error. Regardless, it's probably better to open a new issue for the above error...

mathemusician commented 2 years ago

Interesting, when I run your notebook with lightning >= 1.5, I get:
TypeError: Error instantiating 'pytorch_lightning.trainer.trainer.Trainer' : __init__() got an unexpected keyword argument 'truncated_bptt_steps'
I assume that when that error is fixed, the next error will be the validation key missing though. When I run it locally with git clone and pip install . and lightning >= 1.5, I didn't get this error. Regardless, it's probably better to open a new issue for the above error...

There already is an issue open for the above error: #212

That said, were you still looking to do a pull request for this feature? Because if not, I just might do it as well.

mariomeissner commented 2 years ago

Oh, I'd like to make the PR then! Thanks for the opportunity.

On Fri, Dec 3, 2021, 01:27 Jv Kyle Eclarin @.***> wrote:

Interesting, when I run your notebook with lightning >= 1.5, I get:

TypeError: Error instantiating 'pytorch_lightning.trainer.trainer.Trainer' : init() got an unexpected keyword argument 'truncated_bptt_steps'

I assume that when that error is fixed, the next error will be the validation key missing though. When I run it locally with git clone and pip install . and lightning >= 1.5, I didn't get this error. Regardless, it's probably better to open a new issue for the above error...

There already is an issue open for the above error: #212 https://github.com/PyTorchLightning/lightning-transformers/issues/212

That said, were you still looking to do a pull request for this feature? Because if not, I just might do it as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PyTorchLightning/lightning-transformers/issues/213#issuecomment-984788189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFLAOR2Z5K2XS6DDLAQNMOLUO6M7ZANCNFSM5JD5RT4Q .

mariomeissner commented 2 years ago

Sorry for the delay, been through a tough week. I've been looking into this again, and it seems the suggested approach would not work if we set the arguments train_val_split (which we honestly shouldn't if we're using a dataset that already has a split) and, more importantly, limit_[subset]_samples. Reason being, we also hard-code "validation" there.

I've thought about two approaches to solve this more universally.

Store the subset names as attributes and always access subsets using the attribute as key instead of the hard-coded strings "validation"/"test".
Rename any odd subset names into the standard convention, so that any following code can work with hard-coded subset names as usual. I personally think this approach is better because it can avoid further issues if people don't realize they should be using the attribute key.

I'll PR the second approach, and welcome any comments.

mathemusician commented 2 years ago

@mariomeissner, I agree with the second approach, and I'm curious to see how you implement it

mariomeissner commented 2 years ago

Created a PR #214

mathemusician commented 2 years ago

Created a PR #214

I like it! Very neat.

Lightning-Universe / lightning-transformers