The benchmark dataset will serve as the litmus test for a model's performance so we have to make sure the dataset is representative of what we need.
Why was the currently proposed design selected over alternatives?
This approach was selected because it leads to equal weights given to all departments.
What would be the impact of going with one of the alternative approaches?
An alternative would be to take a random percentage of the whole dataset which may give uneven distribution.
Testing
The test for the distribution is part of the notebook where we check the distribution of the different departments.
RFC0114: Recreate the benchmark dataset, ensuring a more balanced distribution across all departments
Named Concepts
Benchmark dataset: The dataset is used as a reference point for performance evaluation.
Summary
Recreate a benchmark dataset with a more even distribution within departments, specifically considering genders, ages, and education qualifications.
Dependencies
saymore-report-generator
Infrastructures
This can be run on a local machine.
Justification
The benchmark dataset will serve as the litmus test for a model's performance so we have to make sure the dataset is representative of what we need.
Testing
The test for the distribution is part of the notebook where we check the distribution of the different departments.
Implementation Steps
stt-combine-datasets
Reviewed By
spsither