In this PR, classes and functions for 3 new classification tasks were added. Existing NLI dataset was adapted to recent trainer, dataset and evaluator refactoring.
For multi-category classification, f1, precision and recall received an error, interrupting training. Converted them into weighted metrics for multi-class cases.
Realized that some datasets contain many duplicates. Prepared a deduplicate_data function to remove duplicates within splits. Between splits is not yet implemented. (The issue was more apparent in the datasets that only contain train set)
Fixed train-test-val split issue when there's no val or test set in the original dataset. Now: If both test and validation don't exist, 10-10-80 ratio is employed for randomly splitting from the train dataset. If either test or val is missing, 10% is taken from the train split (10:test/val 90: train).
In this PR, classes and functions for 3 new classification tasks were added. Existing NLI dataset was adapted to recent trainer, dataset and evaluator refactoring.
New datasets
Changes
deduplicate_data
function to remove duplicates within splits. Between splits is not yet implemented. (The issue was more apparent in the datasets that only contain train set)