mariomeissner commented 2 years ago

This PR addresses issue, closes #213

I modify the load_dataset function in core/nlp/data.py to find and use special subset names if provided, then rename them back to the standard names to avoid failures further along.

Cannot be done in [subset]_dataloader function as initially proposed, because other functions such as _select_samples are called before that (and would not work well).

Two main reasons for patching core/nlp/data/py instead of core/data.py:

load_dataset is not present in the former. Could create a function there and then let the function in the latter call that first...
Never saw this issue happen in speech or vision, it really seems to be an MNLI only thing for now.

It's up to debate if this should really be moved to core/data.py instead.

This PR should likely also include documentation changes, but I didn't check yet how to do that. Pointers appreciated.

I tested the following commands successfully:

# A normal glue task without any subset name issues
CUDA_VISIBLE_DEVICES=3 python train.py task=nlp/text_classification dataset=nlp/text_classification/glue dataset.cfg.dataset_config_name=sst2 trainer.val_check_interval=0.01 training.batch_size=128 trainer.gpus=1

# MNLI issue with renamed subset, smaller validation size and more frequent validation to test that subsampling also works fine
CUDA_VISIBLE_DEVICES=3 python train.py task=nlp/text_classification dataset=nlp/text_classification/glue dataset.cfg.dataset_config_name=mnli ++dataset.cfg.validation_subset_name=validation_matched trainer.val_check_interval=0.01 training.batch_size=128 trainer.gpus=1 ++dataset.cfg.limit_val_samples=256

mathemusician commented 2 years ago

For documentation, you could probably write something new on this page by writing something in the nlp folder

Arguably, there would be a separate page for this; but I don't have the credentials to approve/disprove that type of change.

mariomeissner commented 2 years ago

So I mentioned it inside "Custom Data Files" for now, but I agree that it's not really the adequate home for it (this is not about custom files). For now I guess it suffices!

codecov[bot] commented 2 years ago

Codecov Report

Merging #214 (1fa0133) into master (4efda7b) will decrease coverage by 0%. The diff coverage is 42%.

@@          Coverage Diff          @@
##           master   #214   +/-   ##
=====================================
- Coverage      86%    86%   -0%     
=====================================
  Files          70     70           
  Lines        1549   1557    +8     
=====================================
+ Hits         1337   1340    +3     
- Misses        212    217    +5

mathemusician commented 2 years ago

In the documentation, I would add that this works for test and train files too. Something like:

Additionally, this works for test and train files as well:

++dataset.cfg.train_subset_name=name_of_subset
++dataset.cfg.test_subset_name=name_of_subset

Besides that, I don't see anything else to add.

mariomeissner commented 2 years ago

Sorry I missed your last comment @mathemusician. I now added what you suggested. It seems the CI is failing, but for something unrelated? Let me know if there's anything further to do.

mathemusician commented 2 years ago

@mariomeissner Just saw this. @Borda I can fix the CI issue, it's just an import problem. @clementpoiret changed _TORCH_MAX_VERSION_1_8_1 to _TORCH_MAX_VERSION_SPARSEML

SeanNaren commented 2 years ago

Thank you for this :)

Lightning-Universe / lightning-transformers

Use and rename special subset name if provided #214

Codecov Report