OpenPecha / Requests

RFWs and RFCs for all OpenPecha repositories
0 stars 0 forks source link

RFC0114: Recreate the benchmark dataset, ensuring a more balanced distribution across all departments #448

Closed spsither closed 4 months ago

spsither commented 4 months ago

RFC0114: Recreate the benchmark dataset, ensuring a more balanced distribution across all departments

Named Concepts

Benchmark dataset: The dataset is used as a reference point for performance evaluation.

Summary

Recreate a benchmark dataset with a more even distribution within departments, specifically considering genders, ages, and education qualifications.

Dependencies

saymore-report-generator

Infrastructures

This can be run on a local machine.

Justification

The benchmark dataset will serve as the litmus test for a model's performance so we have to make sure the dataset is representative of what we need.

Why was the currently proposed design selected over alternatives? This approach was selected because it leads to equal weights given to all departments.

What would be the impact of going with one of the alternative approaches? An alternative would be to take a random percentage of the whole dataset which may give uneven distribution.

Testing

The test for the distribution is part of the notebook where we check the distribution of the different departments.

Implementation Steps

stt-combine-datasets

Reviewed By

spsither