RFC0114: Recreate the benchmark dataset, ensuring a more balanced distribution across all departments

Named Concepts

Benchmark dataset: The dataset is used as a reference point for performance evaluation.

Summary

Recreate a benchmark dataset with a more even distribution within departments, specifically considering genders, ages, and education qualifications.

Dependencies

saymore-report-generator

Infrastructures

This can be run on a local machine.

Justification

The benchmark dataset will serve as the litmus test for a model's performance so we have to make sure the dataset is representative of what we need.

Why was the currently proposed design selected over alternatives? This approach was selected because it leads to equal weights given to all departments.

What would be the impact of going with one of the alternative approaches? An alternative would be to take a random percentage of the whole dataset which may give uneven distribution.

Testing

The test for the distribution is part of the notebook where we check the distribution of the different departments.

Implementation Steps

stt-combine-datasets

[x] Collect all the data for benchmark Estimated time: 20 mins Actual time: 20 mins
[x] Create a benchmark with 1k samples from each department Estimated time: 10 mins Actual time: 10 mins
[x] Parse the STT_CS filename to get metadata Estimated time: 1 h Actual time: 2 h
[x] Use STT_CS metadata for more even representation Estimated time: 1 h Actual time: 1 h

Reviewed By

spsither

OpenPecha / Requests